HumanCellAtlas / ingest-central

Ingest Central is the hub repository for the ingest service
Apache License 2.0
0 stars 1 forks source link

[Importer] Importing spreadsheet fails with incorrect field names that actually aren't used in the spreadsheet. #318

Closed malloryfreeberg closed 5 years ago

malloryfreeberg commented 5 years ago

Give the bug a title

[Importer] Importing spreadsheet fails with incorrect field names that actually aren't used in the spreadsheet.

To Reproduce

  1. Upload spreadsheet in the UI
  2. Click the upload button and wait about 5 second
  3. See an error in a red box

Expected behaviour

Actual behaviour

Impact

No user will be able to import a spreadsheet to test validation.

Screenshots or Messages

screen shot 2019-02-28 at 10 20 42

In this example, gross_image was updates to gross_images in the schema, but this field isn't actually used in the spreadsheet.

screen shot 2019-02-28 at 10 30 01

Nothing gets imported so I can't even check other validation issues.

justincc commented 5 years ago

Incorrect field names were not causing this error before?

malloryfreeberg commented 5 years ago

No. Incorrect field names that didn't have any values filled in were just ignored.

justincc commented 5 years ago

@rdgoite is this likely connected with the recent array/module tab changes?

tburdett commented 5 years ago

It is an absolute requirement that the spreadsheet importer MUST NOT fail when confronted with unrecognised column headers.

As we have discussed many times, any content errors MUST be surfaced by the validator, not the spreadsheet importer; this is explicitly not the responsibility of the importer.

The spreadsheet importer needs to be highly defensive and always produce it's "best attempt" at output JSON from base principles, not "perfect" output JSON as defined by the schema. This goes as far as producing "incorrect" fields from badly named or submitter-added columns using a base rule (e.g. lowercase the header and replace whitespace with underscores). This content will fail to validate, but if the submitter-added heading is a valuable addition the wranglers can modify the schema to ensure the content validates without needing to reingest.

There are many other requirements that are enabled by this importer principle (e.g. @malloryfreeberg's "I have to go change one column, re-upload, see another unrecognized column, change it, upload again, etc…" scenario) so please be very sure this is the current behaviour before closing this ticket

tburdett commented 5 years ago

To clarify the expected behaviour and possible tests that are needed, please confirm the following:

  1. Field names that don't have any values filled are ignored (irrespective of whether they are found in the schema or not)
  2. Field names that do have values are imported and if they are missing from the schema, they are detected as invalid by the validator
  3. In the case where programmatic field names are missing altogether, the importer attempts a lookup based on user friendly names and converts where programmatic names are found
  4. If user friendly name lookup fails to find a single matching programmatic name, a default rule based conversion to JSON is attempted
  5. No content in a spreadsheet is ever ignored (i.e. there is at least one JSON entity per tab and one JSON field per column as a default fallback)
rdgoite commented 5 years ago

Handling of unknown column headers has been added back to the Importer. However, the original cause of the issue in this ticket was hidden columns; the unknown headers were in the spreadsheet but they were hidden and so they weren't searchable.

malloryfreeberg commented 5 years ago

However, the original cause of the issue in this ticket was hidden columns

This was the cause of Matt's error, but not mine. The cause of my error was that the field name was wrong (it had been changed recently), and I could see that it was wrong, but the importer should not have failed.

malloryfreeberg commented 5 years ago

In the spirit of tracking the requirement that "the spreadsheet importer needs to be highly defensive and always produce it's "best attempt" at output JSON from base principles", I am reporting the following:

Uploaded a spreadsheet in dev. One of the fields - project.contributors.corresponding_contributor - is boolean but the user entered "yes - primary". The importer failed with a cryptic message (although I could track it down because the value was unique in the spreadsheet) and didn't display any other metadata.

screen shot 2019-03-01 at 11 08 02

@morrisonnorman @justincc

malloryfreeberg commented 5 years ago

Fixed the above metadata error ("yes - primary") and re-uploaded the spreadsheet in dev. The spreadsheet contains an Excel equation that the contributor entered in order to submit a particular number. The importer failed with a cryptic message and didn't display any other metadata. I am unable to search this equation string "=7*60*60" in the spreadsheet to find the offending value, but as a metadata team member, I know that some fields ask for numbers in seconds and this looks like a calculation for finding seconds (*60*60). So, I went to the Specimen from Organism tab which has a few "in seconds" fields, and found the offending value in col R row 37 (specimen_from_organism.state_of_specimen.ischemic_time).

screen shot 2019-03-01 at 11 23 32

@morrisonnorman

malloryfreeberg commented 5 years ago

@morrisonnorman @justincc @tburdett should I keep reporting issues in this ticket? Or be making new tickets?

justincc commented 5 years ago

I accidentally closed the ticket but it will soon get split so that we're not trying to tackle multiple issues in one ticket.

@malloryfreeberg please create new tickets.

rdgoite commented 5 years ago

The last few updates in this ticket seem to be pointing to bugs of different nature, so yeah, I think it's best to split this.

As for evaluating mathematical expressions in the spreadsheet, I think the assumption at the moment is we won't have those. I think support for expression evaluation is more of an enhancement than a bug fix.

malloryfreeberg commented 5 years ago

Comments have been migrated to new issues.

As for evaluating mathematical expressions in the spreadsheet, I think the assumption at the moment is we won't have those.

Having expressions in an Excel sheet is very common. Not just for integer/number values, but also for generating IDs (e.g. =CONCAT(B2,"-",B3)). These fields obviously don't look like expressions to the contributor, but rather they look like the value they want to enter.

Handling Excel expressions versus values is a common issue. I used to do this in python using openpyxl.

This may be an enhancement, but I would argue it has very high value for wranglers at the moment.

Cross-posting in new issue #323

claymfischer commented 5 years ago

@rdgoite this is an assumption we will have to change. Nearly every spreadsheet I've helped labs with has had expressions, and complex formulas are not uncommon either.

justincc commented 5 years ago

Are these requirements/assumptions documented anywhere apart from in these Github tickets?