Closed ryscher closed 11 months ago
I've still got #2405 open but I might close that in favor of this one—unless you think these are different tickets?
I've read up on the documentation and checked many of the unhelpful "errors". A lot of the type errors are due to frictionless guessing the number of header rows incorrectly, and thus marking the second header row in error.
I've tried decreasing the field_confidence with which frictionless casts data, and increasing the sample_size frictionless uses to cast data. Each just moved the bad type errors around a document. The only thing that seems to actually stop the errors is declaring all field_type='any'
, which prevents data type casting entirely—kind of the nuclear option. But maybe worthwhile, to take the focus off these very prevalent and useless alerts about type and refocus on more worthwhile errors?
I've been discussing this with the curators and am leaning toward setting the universal field type to 'any' and turning off the type errors.
We could build an extra tool, for people to use optionally, where we have users choose the type that each column in their table is meant to be, and then check a file based on that declaration. But I think that would be way too burdensome to ask people to do as a required step, and with frictionless guessing at data types instead it more often then not gets them wrong.
I'm going to turn off the type errors in #2405. This ticket we can keep open for discussion of items 2 and 3 in the list in the first comment on this ticket.
@ahamelers Alright, I went through the files and agree that these 'type' errors are pretty much useless at best (why they can't just say "this cell is blank" like it does with headers is confusing to me) and misleading at worst (why it characterizes some columns with only text-based strings as yearmonth is really odd). Odd to me also that it seemed to miss blank cells in the last column of the MaleSizeMMCB* files. The only example I really saw of it flagging something that maybe would be useful for an author was in biometrics where there was a column of integers and then one cell with two integers separated by "|"; we wouldn't send back for that to be changed per se (only explained), but I'm also not sure that's the best way to represent whatever it's supposed to represent. The headers being blank and column headers being duplicated are still good errors to ping authors about.
For #2405 I've managed to greatly reduce the type errors without completely removing them, so for example the case 'where there was a column of integers and then one cell with two integers separated by "|" ' is still highlighted 🎉
Blank cells are no longer considered a bad 'data type' but just a null value for the row and column, only blank labels and rows are marked blank/missing.
Audrey will make a user guide to complete this ticket...
Copied the new guide page content to https://docs.google.com/document/d/1tTkJNACQwuF5R7DRvg3rj75Opilg07K8oCa0Nfs3N3g/edit for feedback
Improve the error messages using the results of #2305. Is there a way to make the most fixable messages more prominent? Should any of these errors prevent users from submitting?