Closed camiplata closed 1 week ago
This is the reason for the reported issues https://github.com/CatalogueOfLife/data/issues/763 https://github.com/CatalogueOfLife/data/issues/774 https://github.com/CatalogueOfLife/data/issues/775 https://github.com/CatalogueOfLife/data/issues/777 https://github.com/CatalogueOfLife/data/issues/778 https://github.com/CatalogueOfLife/data/issues/779
Both GBIF & CLB have datasets from September 30, 2024 from this URL: https://s3.hpc.ut.ee/plutof-public/original/e4511696-7293-4d41-b1b7-fc827a0c3c37.zip
For some reason the parentID for 1.76 million records are not found.
For your example above there is also an acceptedNameUsageID given, which does exist: https://www.checklistbank.org/dataset/30477/taxon/SH0887892.09FU
I suspect the GBIF importer uses that and CLB doesn't - though it should
hm, the record is present in the source file. We only have 2.203.833 verbatim records in CLB, while the dwca has 2.335.419
got it. The rows for the linnean name records with integer ids have one column too little. This is being skipped by CLBs implementation, but not in the GBIF one which simply adds empty columns if missing.
I have modified the importer to pad missing columns - that might impact quite a few datasets in ChecklistBank - hopefully all for the good ;)
Fix deployed and new import running, let's see
All the issues are gone and the classification is back again:
Unite dataset imported in CLB lacks Higher taxonomy, this issue appeared on the latest version of the dataset on CLB, the GBIF source doesn't have this problem
Example Sh0887892.09fu
On CLB lost classification: https://www.checklistbank.org/dataset/30477/taxon/SH0887892.09FU On UNITE with classification: https://unite.ut.ee/bl_forw_sh.php?sh_name=SH0887892.09FU#fndtn-panel2 On GBIF with classification: https://www.gbif.org/species/201763784