Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Ulimit issue #75

Closed tpmccallum closed 9 years ago

tpmccallum commented 9 years ago

I am running the make file and getting a message part way through regarding raising ulimit. I have raised it (Pam.d etc) which allows it to run a bit further but still not successfully is there a way to chunk load (append) rather than all files being processed at once? Or is there something else which may help? I have about 100, 000 files to process. Thanks guys

bmschmidt commented 9 years ago

Hmm, this is an unfamiliar problem to me.

  1. Could you paste in the error?
  2. Which input format are you using?
  3. What is the system configuration?
william-bratches commented 9 years ago

Is this during the conversion process from raw files to input.txt/jsoncataog.txt? I commonly ran into memory issues - you can try creating an unlimited ulimit with "ulimit -s unlimited". If it then runs into a segmentation fault, you're running out of memory and have to attack the problem programmatically or by physically increasing the RAM on your hardware.

I'm not sure how optimized make is for these huge kinds of processing jobs, you may want to try rolling your own script in something like python which has automatic garbage collection and multithreading. Make sure your algorithms are in linear time!

tpmccallum commented 9 years ago

Hi, I think it had to do with character encoding of the text files. At the time of the issue, I noticed errors in relation to character encoding and subsequent processes being spawned (which I believe caused the ulimit message). I am not at the computer at present, so can't provide any logs or evidence, however earlier today I ran a home-made script over the text files (ensuring only utf8 char) and it all worked. Thanks for your help! Really appreciate it! Tim

On Sunday, 12 July 2015, William Bratches notifications@github.com wrote:

Is this during the conversion process from raw files to input.txt/jsoncataog.txt? I commonly ran into memory issues - you can try creating an unlimited ulimit with "ulimit -s unlimited". If it then runs into a segmentation fault, you're running out of memory and have to attack the problem programmatically or by physically increasing the RAM on your hardware.

I'm not sure how optimized make is for these huge kinds of processing jobs, you may want to try rolling your own script in something like python which has automatic garbage collection and multithreading. Make sure your algorithms are in linear time!

— Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/75#issuecomment-120653446 .