500 error on upload w/ missing col/row values

rebeccabilbro commented 8 years ago

I'm getting an error when I attempt to upload datasets that have missing values in some of the columns/rows. Noticing this because a lot of gov't datasets use the first few rows of a table to provide metadata info.

rebeccabilbro commented 8 years ago

good (terrible) example: https://www.ssa.gov/foia/html/FY08CSV.csv

rebeccabilbro commented 8 years ago

another great one: http://www.planecrashinfo.com/1920/1920.htm

rebeccabilbro commented 8 years ago

another example: https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction

bbengfort commented 8 years ago

Great examples - I definitely noticed all the error messages that came up as you were experimenting! The actual error seems to be a Unicode decoding error, which is actually potentially more serious. It leads to the question of whether or not these files are actually unicode encoded or if they have some other scheme (making things way more difficult).

rebeccabilbro commented 8 years ago

Hmm, sounds like it's potentially related to my #43 then?

bbengfort commented 8 years ago

Potentially, though encoding detection is more of an annoying task that's tough to figure out. You could use the file command in your terminal to see if your computer knows the encoding. It's definitely something I'll take a look at.

rebeccabilbro commented 8 years ago

Found some more test data that might help with this issue, see: https://github.com/okfn/messytables/tree/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror

lauralorenz commented 7 years ago

Ok, so specifically for the files that @rebeccabilbro first linked, in terms of unicode decode errors, this problem has been solved, assumedly from the python 3.x upgrade in which the default python encoding is utf-8 instead of ascii so these utf-8 (or utf-8 subset) encoded files no longer caused unicode errors. Specifically I today tested the CSVs from https://www.ssa.gov/foia/html/FY08CSV.csv and https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction, and the HTML from http://www.planecrashinfo.com/1920/1920.htm, with both file storage and S3 backend and none of them caused an error.

How we want to deal with files in encodings that are not utf-8 encoded is a much broader question. For example a utf-16le encoded file (i.e. https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror/utf-16le_encoded.csv) won't work right now since utf-16le isn't a subset of utf-8, but I'm not sure yet if we really care. IMHO, it was unreasonable to expect ascii encoding, but is not unreasonable to expect utf-8/utf-8-subset encoding.

So, given all of that I am going to close this issue in terms of the scope of the initial bug. However I will make a note on the roadmap issue for consideration more generally about how we want to deal with non-utf-8-subset encodings in this project for the future. cc @rebeccabilbro @ojedatony1616 @bbengfort @looselycoupled

DistrictDataLabs / cultivar

500 error on upload w/ missing col/row values #45