EOL / ContentImport

A placeholder for DATA tickets everytime Jira is un-available.
0 stars 1 forks source link

ChecklistBank pipelines #14

Open jhammock opened 2 weeks ago

jhammock commented 2 weeks ago

Something new and different!

This is a potential source of datasets for EOL (geographic distribution and habitat data) and also for our colleagues at ITIS, who are mostly interested in the taxonomic data. The process of converting these files to ITIS import format will probably always require manual curation, whatever we do, but I'd like to offer them a couple of apps to make that process more efficient. This exercise may resemble the BOLD->iNat pipeline project.

Here's a sample dataset: https://www.checklistbank.org/dataset/1172/download

Anyone registered on the GBIF platform can contribute a dataset, and they do seem to vary in vocabulary and formatting within fields, but the sample that I looked at ( five datasets) are fairly consistently structured. I used these settings: format: dwca Choose root taxon: - Exclude ranks below: - Extended: yes Include synonyms: yes

The archive will usually include two files of interest: Taxon and Distribution. Taxon contains most of the target data. Two columns from Distribution should be merged in also- locality and occurrenceStatus, using taxonID as an index. In the Taxon file, references are kept in the namePublishedIn column. There's a lot of duplication in this column, and it will require a lot of processing, so I'd like to pull it out into a separate table and deduplicate, then merge the finished records back into the Taxon table after processing. This temporary table will need an index for that re-merge; the original contents of the namePublishedIn field would do for an index, but if you prefer something more formal, go ahead.

several columns will require mapping. As a starting point, I suggest we try presenting the user with deduplicated lists of values for each of these columns and letting them make the mapping, eg:

Taxon file, taxonomicStatus column: misapplied synonym accepted ambiguous synonym

This could be either a webform or a template file to download, fill in, and upload.

The columns to be mapped are: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated)

jhammock commented 2 weeks ago

Overall field mapping for ITIS

(I'll edit this in place if needed.) Some of these fields can be mapped directly, some will require some checks to determine the destination, and some will come from the user-mapped columns.

dwc field | TWB field -- | -- dwc:taxonID | scientific_nameID dwc:parentNameUsageID | parent_nameID dwc:acceptedNameUsageID | accepted_nameID dwc:taxonomicStatus | name_usage dwc:taxonRank | rank_name dwc:scientificNameAuthorship | taxon_author dwc:genericName | unit_name1 dwc:infragenericEpithet | unit_name2 dwc:specificEpithet | IF dwc:infragenericEpithet absent: unit_name2 \| IF dwc:infragenericEpithet present: unit_name3 dwc:infraspecificEpithet | IF dwc:infragenericEpithet absent: unit_name3 \| IF dwc:infragenericEpithet present: unit_name4 dwc:cultivarEpithet |  IF dwc:infragenericEpithet absent: unit_name3 \| IF dwc:infragenericEpithet present: unit_name4 dwc:namePublishedIn | PULL OUT INTO NEW TABLE dwc:locality | geographic_value dwc:occurrenceStatus | origin
jhammock commented 2 weeks ago

References (the namePublishedIn column) will be messy, so our aim here will be to make life a bit easier for a human reviewer. The ITIS bibliographic format is structured, in several fields, and a bit idiosyncratic. I presume using a bibliographic parser is the best first step. I used https://anystyle.io/, which was well reviewed in a couple of recent lists, but if you prefer another parser, send me the output from our sample dataset's references and we can do the mapping from there.

eliagbayani commented 1 day ago

DwCA_from_ChecklistBank.zip @jhammock Clarifications.

  1. So the task is for us to create a web form where our input is a DwCA generated by the CheckListBank web tool. https://www.checklistbank.org/dataset/1172/download
  2. Sample DwCA input is attached (DwCA_from_ChecklistBank.zip)
  3. The output of our form will be two files: 1st file: is the table you described here 2nd file: A References file with 2-columns (ReferenceID, Reference). Where the Reference is the deduplicated list of the Taxon!dwc:namePublishedIn The ReferenceID will be unique for this file as well. Then we have 2 options:
    1. either we use the ReferenceID to auto populate the field: Taxon!dwc:namePublishedIn
    2. or we create a 3rd column taxonIDs in this References file, which will be a pipe "|" separated values of taxonIDs

Question: But sorry, I don't understand the step where we deduplicate lists of values of several fields: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated) And letting the user make the mapping either a web form or a template file to download, fill in, and upload. Can you please explain this more :-) , thanks.

jhammock commented 1 day ago

1-3 above check out, thanks

The columns mentioned in the confusing part cannot be copied directly into the output file, because its destination has a rigid controlled vocabulary. They may, however, contain useful information that should be included in the output file. Usually, the dataset creator will have used their own personal vocabulary for something like taxonomicStatus. What we usually do in a case like this for an EOL dataset- create a dictionary of likely text strings and a mapping to the controlled vocabulary- might work for many checklistbank datasets, and that's an option for this project.

I expect, though, that these strings may vary more widely than we're used to, so I thought it might be more robust to let the widget user help us create the mapping. Hence- we extract and deduplicate values for a column; that's the source strings for the mapping. The user fills in the output strings and hands the mapping back to the widget. The widget applies the mappings to the dwca file.

If this is not practical, let me know. Oh- if it helps, we could ask the user to select the output strings from a list, since I know the controlled vocabulary for the columns in question. If all versions of this idea are too many moving parts, likely to break, unwise for any reason, then I think our usual mapping method is a decent fallback option, in which case I can use the samples I've seen to make you a first draft mapping.

Does that help?

eliagbayani commented 1 day ago

@jhammock Thanks!, I understand now. And yes, it will be nice after we provide the deduplicated raw values; we also show the correct controlled vocabulary list for each field as a guide for the user.

jhammock commented 1 day ago

Cool. I'll get to work on the controlled vocabulary lists.

One more belated thing, for 3.ii : it's the reference that should be identified for the reconnection, one way or another, not the taxon. The relationship may be several taxa -> one reference.

jhammock commented 1 day ago

FTR I'm not wedded to anyStyle if you want to try an alternative product. We could also try looking the references up (google scholar or something?) instead of parsing them, if that's an option. One of the issues I've run into is incomplete references (eg: title, author, date but no journal name) which might benefit from some reference -matching, so that could be a value add...

eliagbayani commented 10 hours ago

@jhammock I checked yesterday the other options for citation parser (ParaCite, ParCite etc.) and reference lookup like CrossRef. But I find AnyStyle parsing to be sound and high up on the list of parsers. Unfortunately I can only run it locally, until @JRice was able to install AnyStyle in eol-archive and also fix-up Ruby 2.5. Thanks Jeremy! Now we can use AnyStyle in the server and in our upcoming web form tool.

Jen, maybe we can use both a citation parser (AnyStyle) and a lookup (CrossRef or Google Scholar) for added value in our References output. We will see. Thanks.

jhammock commented 10 hours ago

Yup, that sounds good. Glad to hear AnyStyle is available to us.