Addressing Encoding Errors

NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification

Other

1.06k stars 202 forks source link

Addressing Encoding Errors #27

Open raulpuric opened 6 years ago

raulpuric commented 6 years ago

We're working this week on addressing some pain points in our data pipeline.

One particular problem seems to be with the byte encoding of certain words/phrases causing crashes.

While we work to address this problem, it would be helpful if those who've been having issues with certain phrases, respond to this issue with sets of phrases that didn't work (or did work) for them.

Thanks

rainjacket commented 6 years ago

Sentences containing codes for non ascii characters in plaintext didn't work in the data processing step. I was using --loose_json and --lazy. IIRC, when not using --lazy, it did train the model but the resulting model had issues and predicted non ascii characters.

I tried modifying the repo code to fix this problem, but for some reason I didn't understand, whatever I tried partially fixed the problem but still had issues with concatenated non-ascii characters in the center of words, like b\xe2\xe2g

In the end I just put some draconian restrictions on my input data before proceeding to the repo code.

raulpuric commented 6 years ago

Got it. I'll try and play around with non ascii chars in plaintext to see if we can make our pipeline handle it.

When you supplied b\xe2\xe2g I'm assuming it was in plaint text like "abcdb\xe2\xe2gghijklmnop". Is this assumption correct?

rainjacket commented 6 years ago

Yes, that's right. I'd already filtered out actual unicode characters from the raw data.

rainjacket commented 6 years ago

I wonder if some of this is caused by locale issues inside Docker, as in https://github.com/pytorch/text/issues/77

rainjacket commented 6 years ago

It seems like my issues might be fixed after the following:

RUN apt-get update && apt-get install -y --no-install-recommends language-pack-en ENV LANG en_US.UTF-8 ENV LC_ALL en_US.UTF-8