Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Efficiency with larger data sets #82

Open tpmccallum opened 9 years ago

tpmccallum commented 9 years ago

Hi, I have a question about ingesting text files in stages (as opposed to running the make file in one sitting). When running the make file with very large number I get the following message, and I can't help think that there may be a more efficient way of ingesting the items. ''' parallel: Warning: No more processes: Decreasing number of running jobs to 1. Raising ulimit -u or /etc/security/limits.conf may help. '''

Just to clarify (as far as I know) there are no issues with the files or the catalog (encoding is good - utf8 only etc). I run smaller sets from time to time for testing and they work fine. This efficiency issue only presents itself when ingesting over say 10 million records.

Please see the following ulimit -a output also

''' core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 31559 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 9000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 31559 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited '''

Thanks so much, Tim

tpmccallum commented 9 years ago

Toying with an idea on Line 81 parallel -a files/metadata/jsoncatalog.txt --block 100M --pipepart python bookworm/MetaParser.py > $@

instead of

cat files/metadata/jsoncatalog.txt | parallel --pipe python bookworm/MetaParser.py > $@

Will report back soon :)