SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
86 stars 25 forks source link

ChecklistBank DWCA exporter -> TaxonWorks DWCA importer testing #3658

Open gdower opened 10 months ago

gdower commented 10 months ago

I've been experimenting with trying to import a DarwinCore archive from ChecklistBank's DWCA exporter using the TW DWCA checklist importer. In order to get it mostly working, I had to make several modifications:

1) taxonomicStatus "accepted" and "provisionally accepted" has to be changed to "valid". 2) For invalid names, ChecklistBank exports taxonID as the acceptedNameUsageID and the scientificNameID concatenated together separated by a hyphen, which breaks relationships with originalNameUsageID. This could be a ChecklistBank bug, because it possibly should also hyphenate those two IDs for the originalNameUsageID column? Or should the TW DWCA importer be using scientificNameID instead of taxonID for originalNameUsageID relationships? 3) The DWCA importer expects genera to be registered for species group synonyms, and ChecklistBank won't be able to provide parent genera for most (if not all) species group synonyms, so there might need to be a process for creating missing genus groups for synonyms. 4) Even if ChecklistBank could provide parent genera for species group synonyms, I think the scientificNameAuthorship might be required at present and ChecklistBank won't have authorship data (there's a nil error if it isn't filled in). 5) ChecklistBank does not fill in acceptedNameUsageID's for valid/accepted names and it seems to be required by the TW DWCA importer, although it works if taxonID is set as the acceptedNameUsageID. 6) originalNameUsageID is required by the TW DWCA importer and almost never is filled in by the ChecklistBank exporter. I think this is a ChecklistBank bug and will open a bug ticket, because we provided basionymIDs for all names except family group names in the COLDP archive and only 2 exported. 7) Possibly the dwc: column prefixes need to be removed, although it doesn't seem to be causing problems.

Even with this corrected archive, I can't get original combinations to import without an error being raised that the object taxon name ID is already taken. I could be doing something wrong though:

image

We might also get other flavors of DarwinCore from other biodiversity informatics infrastructures? Perhaps we need a setting that would specify the source of the archive and have a pre-processing step that would reformat the archive? Or we can try to get the infrastructures aligned better.

gdower commented 10 months ago

https://github.com/CatalogueOfLife/backend/issues/1279