cosmocode / docsearch

Search through uploaded documents in DokuWiki
http://www.dokuwiki.org/plugin:docsearch
11 stars 11 forks source link

Memory consumtion for extracted zip files #6

Closed benzolo closed 9 years ago

benzolo commented 14 years ago

I wrote the zip2txt.sh script (which can be found on the plugin page) to index zip files. Because it joins all converted files from a given zip file to one big monster txt file the indexer will consume a lot of memory while working on that file. On our wiki (4Jears old, around 600 pages , 9GB in size) we have a zip file which contains scientific literature in multiple pdf documents. After joining the conversions together, the indexer / lexer needs a huge amount of memory. I had to set my memory limit for php to 250MB to avoid a crash on the generated textfile. Here is the output of wc for this file: wc ./literatur.zip.txt lines words bytes 78897 1242650 8856762 ./literatur.zip.txt That means the indexer had to handle one huge 8.8MByte txt file which is off course not easy because the current indexing process is trying to index the file in one big step. Is there a way for this plugin to allow indexing of each file found in a zipfile but still return the zipfile as origin in a search? A unpacker script could return a list of file names instead of one big file. Or would it be better to change the indexing process of dokuwiki to handle such big files with less memory consumption?

Andreas

splitbrain commented 9 years ago

There are two limits in the DokuWiki core indexer: memory and time. Both are usually restricted for a single request and we tried to find a middle way that works for usual wiki pages. The docsearch plugin reuses the same indexing mechanism but since documents are usually much larger than wiki pages this isn't always ideal. To answer your question. There is currently no way around the huge memory requirements you experience. Patches would be welcome. Either to the plugin or to the DokuWiki core (assuming they still work for usual wiki pages within usual limits).