Ubuntu IRC broken encoding, impacting generative models downstream

EleutherAI / the-pile

MIT License

1.43k stars 122 forks source link

Ubuntu IRC broken encoding, impacting generative models downstream #102

Open briansemrau opened 1 year ago

briansemrau commented 1 year ago

The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.

For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt This file contains Â¯\_(ãƒ„)_/Â¯ which should instead show as ¯\_(ツ)_/¯, if it were properly encoded.

I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now. However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g. Â¯_(ã)_/Â¯

It looks like this dataset could be cleaned by using the ftfy python library. https://ftfy.readthedocs.io/en/latest/ In my very brief testing, this appears to fix the broken encoding from the file linked above.

Mistobaan commented 1 year ago

~Could we download them again without errors, or are they gone?~ So my guess is that is a utf8-to-ascii error. Maybe the server is messing with the encoding? try to request utf8 when doing the GET request.

briansemrau commented 1 year ago

I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...)

briansemrau commented 1 year ago

Found the solution. The .txt files are mixed encoding, line-by-line.

This dataset must be properly decoded before use. This can be done fairly simply:

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L199-L208

https://github.com/mgedmin/irclog2html/blob/ab7759e4b54f146f9c585d2c71d321fbda5c1e1c/src/irclog2html/irclog2html.py#L141-L154

keunwoochoi commented 1 year ago

@briansemrau do you know if huggingface would decode this properly? i'm not sure where i should look into from https://github.com/huggingface/datasets/tree/main/src/datasets/utils

briansemrau commented 1 year ago

do you know if huggingface would decode this properly?

I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility. You should use the code from the links I posted above to make sure the data is being properly decoded.

keunwoochoi commented 1 year ago

i see. thank you very much!