Closed rebeccabilbro closed 7 years ago
good (terrible) example: https://www.ssa.gov/foia/html/FY08CSV.csv
another great one: http://www.planecrashinfo.com/1920/1920.htm
Great examples - I definitely noticed all the error messages that came up as you were experimenting! The actual error seems to be a Unicode decoding error, which is actually potentially more serious. It leads to the question of whether or not these files are actually unicode encoded or if they have some other scheme (making things way more difficult).
Hmm, sounds like it's potentially related to my #43 then?
Potentially, though encoding detection is more of an annoying task that's tough to figure out. You could use the file
command in your terminal to see if your computer knows the encoding. It's definitely something I'll take a look at.
Found some more test data that might help with this issue, see: https://github.com/okfn/messytables/tree/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror
Ok, so specifically for the files that @rebeccabilbro first linked, in terms of unicode decode errors, this problem has been solved, assumedly from the python 3.x upgrade in which the default python encoding is utf-8 instead of ascii so these utf-8 (or utf-8 subset) encoded files no longer caused unicode errors. Specifically I today tested the CSVs from https://www.ssa.gov/foia/html/FY08CSV.csv and https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction, and the HTML from http://www.planecrashinfo.com/1920/1920.htm, with both file storage and S3 backend and none of them caused an error.
How we want to deal with files in encodings that are not utf-8 encoded is a much broader question. For example a utf-16le encoded file (i.e. https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror/utf-16le_encoded.csv) won't work right now since utf-16le isn't a subset of utf-8, but I'm not sure yet if we really care. IMHO, it was unreasonable to expect ascii encoding, but is not unreasonable to expect utf-8/utf-8-subset encoding.
So, given all of that I am going to close this issue in terms of the scope of the initial bug. However I will make a note on the roadmap issue for consideration more generally about how we want to deal with non-utf-8-subset encodings in this project for the future. cc @rebeccabilbro @ojedatony1616 @bbengfort @looselycoupled
I'm getting an error when I attempt to upload datasets that have missing values in some of the columns/rows. Noticing this because a lot of gov't datasets use the first few rows of a table to provide metadata info.