Open bbengfort opened 8 years ago
To fix this, I wrote a white list of Post IDs:
Wrote 566142 Posts IDs in 644.759 seconds
Post.objects.count() == 566142
Wrote a script called blaze.py
- which goes through and attempts to find any bad decoding errors in posts:
100%|███████████████████████████████████████████████████████████████████████████████| 566142/566142 [02:09<00:00, 4365.45id/s]
Phase One: wrote 566142 Posts IDs in 2 minutes 9 seconds
100%|█████████████████████████████████████████████████████████████████████████████| 566142/566142 [37:09<00:00, 267.01posts/s]
Phase Two: wrote 2 Post errors in 37 minutes 9 seconds
It only came up with 2 errors:
"571a333ac1808103a0d6067c",'utf-8' codec can't decode byte 0xed in position 48824: invalid continuation byte
"57726c2ac1808103a5ed63d6",'utf-8' codec can't decode byte 0xed in position 21004: invalid continuation byte
The pymongo driver is very strict and if it can't decode a mongo document it raises an exception.
This is turning up in export where apparently (after 12 minutes or so) a Post with an encoding error turns up and crashes the entire export process. Which is bad.