Closed malloryfreeberg closed 5 years ago
Incorrect field names were not causing this error before?
No. Incorrect field names that didn't have any values filled in were just ignored.
@rdgoite is this likely connected with the recent array/module tab changes?
It is an absolute requirement that the spreadsheet importer MUST NOT fail when confronted with unrecognised column headers.
As we have discussed many times, any content errors MUST be surfaced by the validator, not the spreadsheet importer; this is explicitly not the responsibility of the importer.
The spreadsheet importer needs to be highly defensive and always produce it's "best attempt" at output JSON from base principles, not "perfect" output JSON as defined by the schema. This goes as far as producing "incorrect" fields from badly named or submitter-added columns using a base rule (e.g. lowercase the header and replace whitespace with underscores). This content will fail to validate, but if the submitter-added heading is a valuable addition the wranglers can modify the schema to ensure the content validates without needing to reingest.
There are many other requirements that are enabled by this importer principle (e.g. @malloryfreeberg's "I have to go change one column, re-upload, see another unrecognized column, change it, upload again, etc…" scenario) so please be very sure this is the current behaviour before closing this ticket
To clarify the expected behaviour and possible tests that are needed, please confirm the following:
Handling of unknown column headers has been added back to the Importer. However, the original cause of the issue in this ticket was hidden columns; the unknown headers were in the spreadsheet but they were hidden and so they weren't searchable.
However, the original cause of the issue in this ticket was hidden columns
This was the cause of Matt's error, but not mine. The cause of my error was that the field name was wrong (it had been changed recently), and I could see that it was wrong, but the importer should not have failed.
In the spirit of tracking the requirement that "the spreadsheet importer needs to be highly defensive and always produce it's "best attempt" at output JSON from base principles", I am reporting the following:
Uploaded a spreadsheet in dev. One of the fields - project.contributors.corresponding_contributor
- is boolean but the user entered "yes - primary". The importer failed with a cryptic message (although I could track it down because the value was unique in the spreadsheet) and didn't display any other metadata.
@morrisonnorman @justincc
Fixed the above metadata error ("yes - primary") and re-uploaded the spreadsheet in dev. The spreadsheet contains an Excel equation that the contributor entered in order to submit a particular number. The importer failed with a cryptic message and didn't display any other metadata. I am unable to search this equation string "=7*60*60" in the spreadsheet to find the offending value, but as a metadata team member, I know that some fields ask for numbers in seconds and this looks like a calculation for finding seconds (*60*60). So, I went to the Specimen from Organism tab which has a few "in seconds" fields, and found the offending value in col R row 37 (specimen_from_organism.state_of_specimen.ischemic_time
).
@morrisonnorman
@morrisonnorman @justincc @tburdett should I keep reporting issues in this ticket? Or be making new tickets?
I accidentally closed the ticket but it will soon get split so that we're not trying to tackle multiple issues in one ticket.
@malloryfreeberg please create new tickets.
The last few updates in this ticket seem to be pointing to bugs of different nature, so yeah, I think it's best to split this.
As for evaluating mathematical expressions in the spreadsheet, I think the assumption at the moment is we won't have those. I think support for expression evaluation is more of an enhancement than a bug fix.
Comments have been migrated to new issues.
As for evaluating mathematical expressions in the spreadsheet, I think the assumption at the moment is we won't have those.
Having expressions in an Excel sheet is very common. Not just for integer/number values, but also for generating IDs (e.g. =CONCAT(B2,"-",B3)). These fields obviously don't look like expressions to the contributor, but rather they look like the value they want to enter.
Handling Excel expressions versus values is a common issue. I used to do this in python using openpyxl.
This may be an enhancement, but I would argue it has very high value for wranglers at the moment.
Cross-posting in new issue #323
@rdgoite this is an assumption we will have to change. Nearly every spreadsheet I've helped labs with has had expressions, and complex formulas are not uncommon either.
Are these requirements/assumptions documented anywhere apart from in these Github tickets?
Give the bug a title
[Importer] Importing spreadsheet fails with incorrect field names that actually aren't used in the spreadsheet.
To Reproduce
Expected behaviour
Actual behaviour
Impact
No user will be able to import a spreadsheet to test validation.
Screenshots or Messages
In this example,
gross_image
was updates togross_images
in the schema, but this field isn't actually used in the spreadsheet.Nothing gets imported so I can't even check other validation issues.