Assertion error when processing Gujarati (an Indic language)

OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch

MIT License

6.78k stars 2.25k forks source link

When running preprocess.py in the main directory, I get the same error as here. This happened when I was trying to preprocess a text file in Gujarati. Upon investigation, I realized that the error seems to stem from the split_corpus function located in the file misc.py.

Below, I have attached two outputs - the first for the actual text and the second for the output using this function for Gujarati.

દરેક ફિલ્ટર ડ્રાઇવર રિ પર્સ પોઇન્ટ સાથે સંકળાય ેલ છે કે કેમ તે જોવા માટે રિ પર્સ ડેટા ની ચકાસણી કરે છે અને જો તે ફિલ્ટર ડ્રાઇવર મેચ થાય છે તેવું ન ક્કી કરશે તો ત્યાર બાદ ફાઇલ સિસ્ટમ કોલ ને ખંડ િત કરી નાખ શે અને તે ના ખાસ કાર્ય નો અમલ કરશે. બિલ બોર્ડ હોટ 100માં ટોચ ના 40 માંથી 21, નવ #1 મુખ્ય ધારા ના રોક હિટ્સ, ચાર ગ્રે મી એવોર્ડ્સ, અને 10 MTV વિ ડિઓ મ્યુઝિક પુરસ્કાર ો આ બૅ ન્ડે અંક ે કર્યા છે.
[b'\xe0\xaa\xa6\xe0\xaa\xb0\xe0\xab\x87\xe0\xaa\x95 \xe0\xaa\xab\xe0\xaa\xbf\xe0\xaa\xb2\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xb0 \xe0\xaa\xa1\xe0\xab\x8d\xe0\xaa\xb0\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb5\xe0\xaa\xb0 \xe0\xaa\xb0\xe0\xaa\xbf \xe0\xaa\xaa\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xb8 \xe0\xaa\xaa\xe0\xab\x8b\xe0\xaa\x87\xe0\xaa\xa8\xe0\xab\x8d\xe0\xaa\x9f \xe0\xaa\xb8\xe0\xaa\xbe\xe0\xaa\xa5\xe0\xab\x87 \xe0\xaa\xb8\xe0\xaa\x82\xe0\xaa\x95\xe0\xaa\xb3\xe0\xaa\xbe\xe0\xaa\xaf \xe0\xab\x87\xe0\xaa\xb2 \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\x95\xe0\xab\x87 \xe0\xaa\x95\xe0\xab\x87\xe0\xaa\xae \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\x9c\xe0\xab\x8b\xe0\xaa\xb5\xe0\xaa\xbe \xe0\xaa\xae\xe0\xaa\xbe\xe0\xaa\x9f\xe0\xab\x87 \xe0\xaa\xb0\xe0\xaa\xbf \xe0\xaa\xaa\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xb8 \xe0\xaa\xa1\xe0\xab\x87\xe0\xaa\x9f\xe0\xaa\xbe \xe0\xaa\xa8\xe0\xab\x80 \xe0\xaa\x9a\xe0\xaa\x95\xe0\xaa\xbe\xe0\xaa\xb8\xe0\xaa\xa3\xe0\xab\x80 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xab\x87 \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\x85\xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\x9c\xe0\xab\x8b \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\xab\xe0\xaa\xbf\xe0\xaa\xb2\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xb0 \xe0\xaa\xa1\xe0\xab\x8d\xe0\xaa\xb0\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb5\xe0\xaa\xb0 \xe0\xaa\xae\xe0\xab\x87\xe0\xaa\x9a \xe0\xaa\xa5\xe0\xaa\xbe\xe0\xaa\xaf \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x87\xe0\xaa\xb5\xe0\xab\x81\xe0\xaa\x82 \xe0\xaa\xa8 \xe0\xaa\x95\xe0\xab\x8d\xe0\xaa\x95\xe0\xab\x80 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xaa\xb6\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x8b \xe0\xaa\xa4\xe0\xab\x8d\xe0\xaa\xaf\xe0\xaa\xbe\xe0\xaa\xb0 \xe0\xaa\xac\xe0\xaa\xbe\xe0\xaa\xa6 \xe0\xaa\xab\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb2 \xe0\xaa\xb8\xe0\xaa\xbf\xe0\xaa\xb8\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xae \xe0\xaa\x95\xe0\xab\x8b\xe0\xaa\xb2 \xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\x96\xe0\xaa\x82\xe0\xaa\xa1 \xe0\xaa\xbf\xe0\xaa\xa4 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xab\x80 \xe0\xaa\xa8\xe0\xaa\xbe\xe0\xaa\x96 \xe0\xaa\xb6\xe0\xab\x87 \xe0\xaa\x85\xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\xa8\xe0\xaa\xbe \xe0\xaa\x96\xe0\xaa\xbe\xe0\xaa\xb8 \xe0\xaa\x95\xe0\xaa\xbe\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xaf \xe0\xaa\xa8\xe0\xab\x8b \xe0\xaa\x85\xe0\xaa\xae\xe0\xaa\xb2 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xaa\xb6\xe0\xab\x87.\n', b'\xe0\xaa\xac\xe0\xaa\xbf\xe0\xaa\xb2 \xe0\xaa\xac\xe0\xab\x8b\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xa1 \xe0\xaa\xb9\xe0\xab\x8b\xe0\xaa\x9f 100\xe0\xaa\xae\xe0\xaa\xbe\xe0\xaa\x82 \xe0\xaa\x9f\xe0\xab\x8b\xe0\xaa\x9a \xe0\xaa\xa8\xe0\xaa\xbe 40 \xe0\xaa\xae\xe0\xaa\xbe\xe0\xaa\x82\xe0\xaa\xa5\xe0\xab\x80 21, \xe0\xaa\xa8\xe0\xaa\xb5 #1 \xe0\xaa\xae\xe0\xab\x81\xe0\xaa\x96\xe0\xab\x8d\xe0\xaa\xaf \xe0\xaa\xa7\xe0\xaa\xbe\xe0\xaa\xb0\xe0\xaa\xbe \xe0\xaa\xa8\xe0\xaa\xbe \xe0\xaa\xb0\xe0\xab\x8b\xe0\xaa\x95 \xe0\xaa\xb9\xe0\xaa\xbf\xe0\xaa\x9f\xe0\xab\x8d\xe0\xaa\xb8, \xe0\xaa\x9a\xe0\xaa\xbe\xe0\xaa\xb0 \xe0\xaa\x97\xe0\xab\x8d\xe0\xaa\xb0\xe0\xab\x87 \xe0\xaa\xae\xe0\xab\x80 \xe0\xaa\x8f\xe0\xaa\xb5\xe0\xab\x8b\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xa1\xe0\xab\x8d\xe0\xaa\xb8, \xe0\xaa\x85\xe0\xaa\xa8\xe0\xab\x87 10 MTV \xe0\xaa\xb5\xe0\xaa\xbf \xe0\xaa\xa1\xe0\xaa\xbf\xe0\xaa\x93 \xe0\xaa\xae\xe0\xab\x8d\xe0\xaa\xaf\xe0\xab\x81\xe0\xaa\x9d\xe0\xaa\xbf\xe0\xaa\x95 \xe0\xaa\xaa\xe0\xab\x81\xe0\xaa\xb0\xe0\xaa\xb8\xe0\xab\x8d\xe0\xaa\x95\xe0\xaa\xbe\xe0\xaa\xb0 \xe0\xab\x8b \xe0\xaa\x86 \xe0\xaa\xac\xe0\xab\x85 \xe0\xaa\xa8\xe0\xab\x8d\xe0\xaa\xa1\xe0\xab\x87 \xe0\xaa\x85\xe0\xaa\x82\xe0\xaa\x95 \xe0\xab\x87 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xaf\xe0\xaa\xbe \xe0\xaa\x9b\xe0\xab\x87.\n']

The same function seems to function normally for an English text - actual text and output using the aforementioned function:

Each filter driver examines the reparse data to see whether it is associated with that reparse point, and if that filter driver determines a match, then it intercepts the file system request and performs its special functionality. The band has scored twenty-one Top 40 hits on the "Billboard" Hot 100, nine # 1 Mainstream Rock hits, four Grammy Awards, six American Music Awards, and ten MTV Video Music Awards.
[b'Each filter driver examines the reparse data to see whether it is associated with that reparse point, and if that filter driver determines a match, then it intercepts the file system request and performs its special functionality.\n', b'The band has scored twenty-one Top 40 hits on the "Billboard" Hot 100, nine # 1 Mainstream Rock hits, four Grammy Awards, six American Music Awards, and ten MTV Video Music Awards.\n']

I am quite new to this field and using this particular framework for the first time. Any help would be appreciated.

That stuff is utf-8 encoded bytes, but (I think) the ones that are also ASCII are being converted to regular text. You can test that with this. Scroll to the end of this box - I'm just using .decode('utf-8')

b'\xe0\xaa\xa6\xe0\xaa\xb0\xe0\xab\x87\xe0\xaa\x95 \xe0\xaa\xab\xe0\xaa\xbf\xe0\xaa\xb2\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xb0 \xe0\xaa\xa1\xe0\xab\x8d\xe0\xaa\xb0\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb5\xe0\xaa\xb0 \xe0\xaa\xb0\xe0\xaa\xbf \xe0\xaa\xaa\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xb8 \xe0\xaa\xaa\xe0\xab\x8b\xe0\xaa\x87\xe0\xaa\xa8\xe0\xab\x8d\xe0\xaa\x9f \xe0\xaa\xb8\xe0\xaa\xbe\xe0\xaa\xa5\xe0\xab\x87 \xe0\xaa\xb8\xe0\xaa\x82\xe0\xaa\x95\xe0\xaa\xb3\xe0\xaa\xbe\xe0\xaa\xaf \xe0\xab\x87\xe0\xaa\xb2 \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\x95\xe0\xab\x87 \xe0\xaa\x95\xe0\xab\x87\xe0\xaa\xae \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\x9c\xe0\xab\x8b\xe0\xaa\xb5\xe0\xaa\xbe \xe0\xaa\xae\xe0\xaa\xbe\xe0\xaa\x9f\xe0\xab\x87 \xe0\xaa\xb0\xe0\xaa\xbf \xe0\xaa\xaa\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xb8 \xe0\xaa\xa1\xe0\xab\x87\xe0\xaa\x9f\xe0\xaa\xbe \xe0\xaa\xa8\xe0\xab\x80 \xe0\xaa\x9a\xe0\xaa\x95\xe0\xaa\xbe\xe0\xaa\xb8\xe0\xaa\xa3\xe0\xab\x80 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xab\x87 \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\x85\xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\x9c\xe0\xab\x8b \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\xab\xe0\xaa\xbf\xe0\xaa\xb2\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xb0 \xe0\xaa\xa1\xe0\xab\x8d\xe0\xaa\xb0\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb5\xe0\xaa\xb0 \xe0\xaa\xae\xe0\xab\x87\xe0\xaa\x9a \xe0\xaa\xa5\xe0\xaa\xbe\xe0\xaa\xaf \xe0\xaa\x9b\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x87\xe0\xaa\xb5\xe0\xab\x81\xe0\xaa\x82 \xe0\xaa\xa8 \xe0\xaa\x95\xe0\xab\x8d\xe0\xaa\x95\xe0\xab\x80 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xaa\xb6\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x8b \xe0\xaa\xa4\xe0\xab\x8d\xe0\xaa\xaf\xe0\xaa\xbe\xe0\xaa\xb0 \xe0\xaa\xac\xe0\xaa\xbe\xe0\xaa\xa6 \xe0\xaa\xab\xe0\xaa\xbe\xe0\xaa\x87\xe0\xaa\xb2 \xe0\xaa\xb8\xe0\xaa\xbf\xe0\xaa\xb8\xe0\xab\x8d\xe0\xaa\x9f\xe0\xaa\xae \xe0\xaa\x95\xe0\xab\x8b\xe0\xaa\xb2 \xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\x96\xe0\xaa\x82\xe0\xaa\xa1 \xe0\xaa\xbf\xe0\xaa\xa4 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xab\x80 \xe0\xaa\xa8\xe0\xaa\xbe\xe0\xaa\x96 \xe0\xaa\xb6\xe0\xab\x87 \xe0\xaa\x85\xe0\xaa\xa8\xe0\xab\x87 \xe0\xaa\xa4\xe0\xab\x87 \xe0\xaa\xa8\xe0\xaa\xbe \xe0\xaa\x96\xe0\xaa\xbe\xe0\xaa\xb8 \xe0\xaa\x95\xe0\xaa\xbe\xe0\xaa\xb0\xe0\xab\x8d\xe0\xaa\xaf \xe0\xaa\xa8\xe0\xab\x8b \xe0\xaa\x85\xe0\xaa\xae\xe0\xaa\xb2 \xe0\xaa\x95\xe0\xaa\xb0\xe0\xaa\xb6\xe0\xab\x87.\n'.decode('utf-8')

'દરેક ફિલ્ટર ડ્રાઇવર રિ પર્સ પોઇન્ટ સાથે સંકળાય ેલ છે કે કેમ તે જોવા માટે રિ પર્સ ડેટા ની ચકાસણી કરે છે અને જો તે ફિલ્ટર ડ્રાઇવર મેચ થાય છે તેવું ન ક્કી કરશે તો ત્યાર બાદ ફાઇલ સિસ્ટમ કોલ ને ખંડ િત કરી નાખ શે અને તે ના ખાસ કાર્ય નો અમલ કરશે.\n'

Just as

b'Each filter driver examines the reparse data to see whether it is associated with that reparse point, and if that filter driver determines a match, then it intercepts the file system request and performs its special functionality.\n'.decode('utf-8')

gives

'Each filter driver examines the reparse data to see whether it is associated with that reparse point, and if that filter driver determines a match, then it intercepts the file system request and performs its special functionality.\n'

You can kind of test the other way too, by .encode("utf-8")ing the unicode string:

u"દરેક ફિલ્ટર ડ્રાઇવર રિ પર્સ પોઇન્ટ સાથે સંકળાય ેલ છે કે કેમ તે જોવા માટે રિ પર્સ ડેટા ની ચકાસણી કરે છે અને જો તે ફિલ્ટર ડ્રાઇવર મેચ થાય છે તેવું ન ક્કી કરશે તો ત યાર બાદ ફાઇલ સિસ્ટમ કોલ ને ખંડ િત કરી નાખ શે અને તે ના ખાસ કાર્ય નો અમલ કરશે. બિલ બોર્ડ હોટ 100માં ટોચ ના 40 માંથી 21, નવ #1 મુખ્ય ધારા ના રોક હિટ્સ, ચાર ગ્રે મી એવોર્ડ્સ, અને 10 MTV વિ ડિઓ મ્યુઝિક પુરસ્કાર ો આ બૅ ન્ડે અંક ે કર્યા છે.".encode("utf-8")

gives the same byte string. (Note the u"..." if you're using Python 2.7)

So, I'm not so sure that what you're seeing is what's causing the problem.

Looking at the issue you linked above, it looks like your data might contain line break sequences. Try running wc -l <source file> and then wc -l <target file>. That will print the newline count for each file. Likely these numbers will be different.

OpenNMT / OpenNMT-py

Assertion error when processing Gujarati (an Indic language) #1332