SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
86 stars 25 forks source link

DWC-A from ITIS #2058

Open teleaslamellatus opened 3 years ago

teleaslamellatus commented 3 years ago

ITIS import using DWC-A

I sort of gave up Castor Import of nomenclature, but still think a import usig ITIS would be very useful, so I tried to import DWC-A files from ITIS, but they do not work. Is it because the DWC import could be oly done if the taxon hierarchy is already present? 9259859.zip

LocoDelAssembly commented 3 years ago

@teleaslamellatus the file you uploaded had taxa.txt incorrectly named, but even after fixed then there is another problem of inconsistent line termination that had to be solved by re-saving the files with file editor to make all lines the same. Finally, there was a mapping error in meta.xml with all indices offset by one. After fixing all that the file will get staged, but still not be enough to import because acceptedUsageNameID must point to valid name regardless of taxonomicStatus (if valid then must be equal to taxonID), so you'll end up with all names in NotReady status.

Could you point me to the tools you use to produce the DwC-As? I'd like to take a closer look to see what is going on. Is any of https://github.com/GlobalNamesArchitecture/dwca_hunter/ or https://github.com/gaurav/dwca-hunter ? Which one?

Thanks

PS: The corrected file, but please do note that although the importer accepts it and displays the grid view, it will not import anything in its current form: 9259859-new.zip

teleaslamellatus commented 3 years ago

Hello Hernan,

I got the file straight from the ITIS website! This sounds ridiculous!

I know symbiota can import files from ITIS, it would be perhaps reasonable to reach out to Ed Gilbert?

Thanks a lot!

Istvan

On Mon, Feb 22, 2021 at 10:17 PM Hernán Lucas Pereira < notifications@github.com> wrote:

@teleaslamellatus https://github.com/teleaslamellatus the file you uploaded had taxa.txt incorrectly named, but even after fixed then there is another problem of inconsistent line termination that had to be solved by re-saving the files with file editor to make all lines the same. Finally, there was a mapping error in meta.xml with all indices offset by one. After fixing all that the file will get staged, but still not be enough to import because acceptedUsageNameID must point to valid name regardless of taxonomicStatus (if valid then must be equal to taxonID), so you'll end up with all names in NotReady status.

Could you point me to the tools you use to produce the DwC-As? I'd like to take a closer look to see what is going on. Is any of https://github.com/GlobalNamesArchitecture/dwca_hunter/ or https://github.com/gaurav/dwca-hunter ? Which one?

Thanks

PS: The corrected file, but please do note that although the importer accepts it and displays the grid view, it will not import anything in its current form: 9259859-new.zip https://github.com/SpeciesFileGroup/taxonworks/files/6026570/9259859-new.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpeciesFileGroup/taxonworks/issues/2058#issuecomment-783847420, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGX6JPV3DVLFJJEKSZ76EDTAMM4RANCNFSM4YBEWP6Q .

-- István Mikó PhD Collection Manager Don Chandler Entomological Collection Department of Biological Sciences University of New Hampshire Spaulding Hall Durham, NH 03824

tmcelrath commented 3 years ago

This may be repetitive, but has anyone pointed out that you can download a DWCA or any taxa and its children from ITIS by navigating to that taxon and clicking the "Download DwC-A" button?

image

teleaslamellatus commented 3 years ago

Yes, the DWC-A Hérnan is working with was downloaded from ITIS.

LocoDelAssembly commented 3 years ago

@tmcelrath 🤦‍♂️ I don't know what I was doing but the only option was some ITIS custom text format, no option of DwC-A... Thanks

So they indeed have a problem. Third party to confirm:

https://tools.gbif.org/dwca-reports/054-897625691578541428.html (Gryllus 7536072.zip) https://tools.gbif.org/dwca-reports/054-2536582649780533159.html (@teleaslamellatus sample with only taxa.txt filename corrected) https://tools.gbif.org/dwca-reports/054-1547452434238719386.html (fully corrected @teleaslamellatus sample)

BTW, I think I'll revisit relaxing acceptedNameUsageID requirement in cases where taxononicStatus equals valid unless I find something against this idea. The reference datasets do always use acceptedNameUsageID, also GBIFs guidelines recommend doing that (because ICN is more complex than ICZN and could happen that a name is valid but another is the accepted one?)

mitchelldf commented 3 years ago

@LocoDelAssembly , Wallace - the ITIS DwCA download generator - is designed to have NULL acceptedNameUsageID when taxonomicStatus equals valid. From the Wallace data application manual

The Darwin Core term dwc:acceptedNameUsageID is the identifier for the currently valid or accepted taxon. The ITIS DwCA implementation uses the Taxonomic Serial Number (TSN) from the field tsn_accepted (table synonym_links) to populate acceptedNameUsageID.The field is null when dwc:taxonomicStatusis 'accepted' or 'valid', and is non-null when dwc:taxonomicStatus is 'synonym', 'homotypic synonym', 'heterotypic synonym', 'proParteSynonym', or 'misapplied'. In ITIS an invalid/not accepted synonym can have more than one valid/accepted name. When this occurs the result will be multiple records differing only in values found in acceptedNameUsageID and possibly in the dwc:modified field and the hierarchy attribute fields.

If acceptedNameUsageID becomes nullable for valid/accepted names, and Wallace's meta.xml usage is fixed so the field count begins at zero (ITIS can fix that), and the user updates the name of the archive contents before upload (eml_9259859.xml to eml.xml and taxa_9259859.txt to taxa.txt - an identity 'feature' that perhaps could change in the future), then everything should be a-ok.

I did not notice the issue of inconsistent line termination. Wallace's record terminator should be unix-style linefeed feed character \n. I will investigate further.

LocoDelAssembly commented 3 years ago

Thanks @mitchelldf! Will change import on my side to allow NULL acceptedNameUsageID for valid names, looks safe. I guess GBIF's guidelines/examples shown using it always to simplify parsing.

Regarding the inconsistent line termination, mentioned on gitter the placement for forgot to reiterate here. The problem is with the header line that is different (Unix-style) than the data lines (Windows-style):

$ tail -n +1 taxa_7536072.txt | file - # Entire file
/dev/stdin: UTF-8 Unicode text, with very long lines, with CRLF, LF line terminators
$ tail -n +2 taxa_7536072.txt | file - # Skip header line
/dev/stdin: UTF-8 Unicode text, with very long lines, with CRLF line terminators
mitchelldf commented 3 years ago

The ITIS DwCA download has been updated. Now all lines are terminated with \n, and the metal.xml reflects this

encoding="UTF-8" linesTerminatedBy="\n"

And the field index count to begins at zero in the meta.xml file.

See attached 5680265.zip

If the _5680265 ID string of the eml and taxa file names is removed, the import should run smoothly.

LocoDelAssembly commented 3 years ago

Thanks for the update @mitchelldf!

After renaming files to strip out numbers I get this report: https://tools.gbif.org/dwca-reports/083-88078739738051714.html

The fix for the first error would be to add <id index="0" /> in the same section <field> elements are located (it is absolutely fine to use the same index taxonId is using, no need to add extra column in the data file). The problem with superfamily is harder to solve, requires creating an extension of your own and adding it in meta (and take advantage of the fact that extension data can be mapped to the same data file the core data is). Standard practice I believe is to just ignore unknown terms rather than failing to process the archive, so I don't think fixing superfamily is important. Not having the file names matching those in meta.xml is important as processors cannot read the data otherwise.

mitchelldf commented 3 years ago

@LocoDelAssembly , thanks for that tip regarding <id index="0" /> ITIS will update the meta.xml file. The superfamily field was user requested as part of the core file even though the Darwin Core standard did not support the term. I will have to research custom extensions. Perhaps that could be a solution for other desired ranks within the core file in the future. I will also see about getting that meta.xml file updated with the properly derived file names.