Tokenization failing for IITB Monolingual corpus

shantipriyap commented 6 years ago

Getting the below error while trying to do the tokenization for IITB monolingual corpus while same is working fine for the parallel corups(target language - Hindi)

Traceback (most recent call last): File "indic_tokenize.py", line 67, in for line in ifile.readlines(): File "/usr/lib/python2.7/codecs.py", line 676, in readlines return self.reader.readlines(sizehint) File "/usr/lib/python2.7/codecs.py", line 585, in readlines data = self.read() File "/usr/lib/python2.7/codecs.py", line 474, in read newchars, decodedbytes = self.decode(data, self.errors) MemoryError

anoopkunchukuttan commented 6 years ago

The problem is that the command-line interface reads in the entire file before processing. I have now changed it to read line by line, hence the memory problem for large files. Let me know if that solves the problem.

shantipriyap commented 6 years ago

Thanks Anoop for the quick fix. I have tried and its working now without memory error. I am closing the issue.

anoopkunchukuttan / indic_nlp_library

Tokenization failing for IITB Monolingual corpus #15