datadryad / dryad-product-roadmap

Repository of issues for Dryad project boards
https://github.com/orgs/datadryad/projects
8 stars 0 forks source link

Improve Frictionless error messages #2838

Closed ryscher closed 11 months ago

ryscher commented 1 year ago

Improve the error messages using the results of #2305. Is there a way to make the most fixable messages more prominent? Should any of these errors prevent users from submitting?

ahamelers commented 1 year ago

I've still got #2405 open but I might close that in favor of this one—unless you think these are different tickets?

  1. I think many of the type errors, which are the most numerous, are not actually helpful, and I plan to look at the frictionless documentation again to see if there's anything we can do to reduce those.
  2. The next step we'd discussed was making a user guide covering prominent error types, which I think will help a lot with the missing/blank/duplicate data issues (and which I do think are actual issues!). We can even redesign the delivery of the frictionless report to include links to documentation on each error type.
    1. We should learn from the README user testing and focus on WHY people should make their data accessible and interoperable and bother about things like un-merging table cells, not just on a how-to. A lot of frustration around frictionless, even on the part of the curators, seems to stem from thinking that if a human being can struggle enough and figure out table content, it's fine—but that's not actually meeting FAIR principles
  3. Currently no frictionless errors prevent submission. If any were to start preventing submission, I think the most rare errors (format and encoding) would be the only candidates to consider there—these might actually prevent the files from opening. Here's detailed information about these: bad_errors.csv
ahamelers commented 1 year ago

I've read up on the documentation and checked many of the unhelpful "errors". A lot of the type errors are due to frictionless guessing the number of header rows incorrectly, and thus marking the second header row in error.

ahamelers commented 1 year ago

I've tried decreasing the field_confidence with which frictionless casts data, and increasing the sample_size frictionless uses to cast data. Each just moved the bad type errors around a document. The only thing that seems to actually stop the errors is declaring all field_type='any', which prevents data type casting entirely—kind of the nuclear option. But maybe worthwhile, to take the focus off these very prevalent and useless alerts about type and refocus on more worthwhile errors?

ahamelers commented 1 year ago

I've been discussing this with the curators and am leaning toward setting the universal field type to 'any' and turning off the type errors.

We could build an extra tool, for people to use optionally, where we have users choose the type that each column in their table is meant to be, and then check a file based on that declaration. But I think that would be way too burdensome to ask people to do as a required step, and with frictionless guessing at data types instead it more often then not gets them wrong.

ahamelers commented 1 year ago

I'm going to turn off the type errors in #2405. This ticket we can keep open for discussion of items 2 and 3 in the list in the first comment on this ticket.

bryanmgee commented 1 year ago

@ahamelers Alright, I went through the files and agree that these 'type' errors are pretty much useless at best (why they can't just say "this cell is blank" like it does with headers is confusing to me) and misleading at worst (why it characterizes some columns with only text-based strings as yearmonth is really odd). Odd to me also that it seemed to miss blank cells in the last column of the MaleSizeMMCB* files. The only example I really saw of it flagging something that maybe would be useful for an author was in biometrics where there was a column of integers and then one cell with two integers separated by "|"; we wouldn't send back for that to be changed per se (only explained), but I'm also not sure that's the best way to represent whatever it's supposed to represent. The headers being blank and column headers being duplicated are still good errors to ping authors about.

ahamelers commented 12 months ago

For #2405 I've managed to greatly reduce the type errors without completely removing them, so for example the case 'where there was a column of integers and then one cell with two integers separated by "|" ' is still highlighted 🎉

Blank cells are no longer considered a bad 'data type' but just a null value for the row and column, only blank labels and rows are marked blank/missing.

ryscher commented 11 months ago

Audrey will make a user guide to complete this ticket...

ahamelers commented 11 months ago

Copied the new guide page content to https://docs.google.com/document/d/1tTkJNACQwuF5R7DRvg3rj75Opilg07K8oCa0Nfs3N3g/edit for feedback