NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Review post-processing and workbench mapping in light of changes to the DigiApp #398

Closed PipBrewer closed 9 months ago

PipBrewer commented 11 months ago

What is the issue ?

We need to make sure that the post processing works as intended and preceding this the Digi App output.

Detailed description of the issue.

The question is: Can the Digi App data export be transformed sufficiently to be successfully imported into Specify via workbench?

Why is it needed/relevant ?

This will be a measure of the post processing maturity.

Estimate level of effort required.

easy

What is the expected acceptable result.

A full validation of the post processing output in Specify workbench.

How to approach it?

What could be the challenges ?

There will be multiple versions of the GREL script corresponding to a version and a date range. There needs to be a decision on which GREL script to use ad hoc. Path to GREL scripts: N:\SCI-SNM-DigitalCollections\DaSSCo\Digi App\Data\GREL scripts

warning. The tests for NHMD- and NHMA-Entomology could NOT be completed due to the reasons stated in the final comment.

jlegind commented 10 months ago

Test of output from Digi App v. 1.1.18

The purpose of the test is to follow the Digi App output through the post processing and then check how that behaves in the Specify workbench tool (WB)

Testing mock up records from NHMD Vascular Plants:

Workbench (WB) validation complaints because the encoding makes a mess of non-ASCII characters. The files must be exported from OpenRefine as encoded with "Windows 1252: Western European"

If done this way, the mock-up file for NHMD Vascular Plants passed the test.

Testing mock up records from NHMD Entomology:

Could not complete the test, because the WB validation would not accept "containertype" with our values ['Multiple specimens on one object', 'One specimen on multiple objects']

Testing mock up records from NHMA Entomology:

Could not complete the test, because the validation dropped a novel error : { "uploaderstatus": { "operation": "validating", "taskid": "db7f4b1c-9d22-40aa-b578-6f1387a2b89e" }, "taskstatus": "FAILURE", "taskinfo": "DoesNotExist('Spdataset matching query does not exist.',)" }

A solution to the NHMA issue is currently not obvious.

The Digi App output test procedure is described here: https://github.com/NHMDenmark/Mass-Digitizer/blob/main/documentation/Pre_release%20testing%20Digi%20App%20output%20file.docx

FedorSteeman commented 10 months ago

Testing mock up records from NHMD Vascular Plants:

Workbench (WB) validation complaints because the encoding makes a mess of non-ASCII characters. The files must be exported from OpenRefine as encoded with "Windows 1252: Western European"

The upcoming upgrade of Specify7 allows for specifying encoding upon upload, which preferably should be UTF-8. Until then, stick to ANSI/win1252

Testing mock up records from NHMD Entomology:

Could not complete the test, because the WB validation would not accept "containertype" with our values ['Multiple specimens on one object', 'One specimen on multiple objects']

New container types have now been added to NHMD Entomololgy, so please try again. The test database is pointing again at NHMDtest, so please try there.

Testing mock up records from NHMA Entomology:

Could not complete the test, because the validation dropped a novel error : { "uploaderstatus": { "operation": "validating", "taskid": "db7f4b1c-9d22-40aa-b578-6f1387a2b89e" }, "taskstatus": "FAILURE", "taskinfo": "DoesNotExist('Spdataset matching query does not exist.',)" }

A new ticket has been raised here: https://github.com/NHMDenmark/Mass-Digitizer/issues/436

FedorSteeman commented 10 months ago

Redoing the test & review may take a few hours, but let's give it a day.

FedorSteeman commented 10 months ago

I realized the taxon number is not duplicated for each taxon rank in the processed file. For NHMA this is necessary for workbench matching the correct taxa.

FedorSteeman commented 9 months ago

The script results appear to check out. @jlegind will have to adapt the documentation to match