DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Unicode decode error #73

Open bbengfort opened 8 years ago

bbengfort commented 8 years ago

The pymongo driver is very strict and if it can't decode a mongo document it raises an exception.

This is turning up in export where apparently (after 12 minutes or so) a Post with an encoding error turns up and crashes the entire export process. Which is bad.

bbengfort commented 8 years ago

https://github.com/mongodb-labs/mongo-connector/issues/101

bbengfort commented 8 years ago

To fix this, I wrote a white list of Post IDs:

Wrote 566142 Posts IDs in 644.759 seconds
Post.objects.count() == 566142
bbengfort commented 8 years ago

Wrote a script called blaze.py - which goes through and attempts to find any bad decoding errors in posts:

100%|███████████████████████████████████████████████████████████████████████████████| 566142/566142 [02:09<00:00, 4365.45id/s]
Phase One: wrote 566142 Posts IDs in 2 minutes 9 seconds
100%|█████████████████████████████████████████████████████████████████████████████| 566142/566142 [37:09<00:00, 267.01posts/s]
Phase Two: wrote 2 Post errors in 37 minutes 9 seconds

It only came up with 2 errors:

"571a333ac1808103a0d6067c",'utf-8' codec can't decode byte 0xed in position 48824: invalid continuation byte
"57726c2ac1808103a5ed63d6",'utf-8' codec can't decode byte 0xed in position 21004: invalid continuation byte