CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Unite dataset imported in CLB lacks Higher taxonomy #1373

Closed camiplata closed 1 week ago

camiplata commented 1 week ago

Unite dataset imported in CLB lacks Higher taxonomy, this issue appeared on the latest version of the dataset on CLB, the GBIF source doesn't have this problem

Example Sh0887892.09fu

On CLB lost classification: https://www.checklistbank.org/dataset/30477/taxon/SH0887892.09FU On UNITE with classification: https://unite.ut.ee/bl_forw_sh.php?sh_name=SH0887892.09FU#fndtn-panel2 On GBIF with classification: https://www.gbif.org/species/201763784

camiplata commented 1 week ago

This is the reason for the reported issues https://github.com/CatalogueOfLife/data/issues/763 https://github.com/CatalogueOfLife/data/issues/774 https://github.com/CatalogueOfLife/data/issues/775 https://github.com/CatalogueOfLife/data/issues/777 https://github.com/CatalogueOfLife/data/issues/778 https://github.com/CatalogueOfLife/data/issues/779

mdoering commented 1 week ago

Both GBIF & CLB have datasets from September 30, 2024 from this URL: https://s3.hpc.ut.ee/plutof-public/original/e4511696-7293-4d41-b1b7-fc827a0c3c37.zip

For some reason the parentID for 1.76 million records are not found.

For your example above there is also an acceptedNameUsageID given, which does exist: https://www.checklistbank.org/dataset/30477/taxon/SH0887892.09FU

I suspect the GBIF importer uses that and CLB doesn't - though it should

mdoering commented 1 week ago

hm, the record is present in the source file. We only have 2.203.833 verbatim records in CLB, while the dwca has 2.335.419

mdoering commented 1 week ago

got it. The rows for the linnean name records with integer ids have one column too little. This is being skipped by CLBs implementation, but not in the GBIF one which simply adds empty columns if missing.

mdoering commented 1 week ago

I have modified the importer to pad missing columns - that might impact quite a few datasets in ChecklistBank - hopefully all for the good ;)

mdoering commented 1 week ago

Fix deployed and new import running, let's see

mdoering commented 1 week ago

All the issues are gone and the classification is back again:

image