Open raulpuric opened 6 years ago
Sentences containing codes for non ascii characters in plaintext didn't work in the data processing step. I was using --loose_json and --lazy. IIRC, when not using --lazy, it did train the model but the resulting model had issues and predicted non ascii characters.
I tried modifying the repo code to fix this problem, but for some reason I didn't understand, whatever I tried partially fixed the problem but still had issues with concatenated non-ascii characters in the center of words, like b\xe2\xe2g
In the end I just put some draconian restrictions on my input data before proceeding to the repo code.
Got it. I'll try and play around with non ascii chars in plaintext to see if we can make our pipeline handle it.
When you supplied b\xe2\xe2g I'm assuming it was in plaint text like "abcdb\xe2\xe2gghijklmnop". Is this assumption correct?
Yes, that's right. I'd already filtered out actual unicode characters from the raw data.
I wonder if some of this is caused by locale issues inside Docker, as in https://github.com/pytorch/text/issues/77
It seems like my issues might be fixed after the following:
RUN apt-get update && apt-get install -y --no-install-recommends language-pack-en ENV LANG en_US.UTF-8 ENV LC_ALL en_US.UTF-8
We're working this week on addressing some pain points in our data pipeline.
One particular problem seems to be with the byte encoding of certain words/phrases causing crashes.
While we work to address this problem, it would be helpful if those who've been having issues with certain phrases, respond to this issue with sets of phrases that didn't work (or did work) for them.
Thanks