Open jhammock opened 2 weeks ago
Overall field mapping for ITIS
(I'll edit this in place if needed.) Some of these fields can be mapped directly, some will require some checks to determine the destination, and some will come from the user-mapped columns.
References (the namePublishedIn column) will be messy, so our aim here will be to make life a bit easier for a human reviewer. The ITIS bibliographic format is structured, in several fields, and a bit idiosyncratic. I presume using a bibliographic parser is the best first step. I used https://anystyle.io/, which was well reviewed in a couple of recent lists, but if you prefer another parser, send me the output from our sample dataset's references and we can do the mapping from there.
DwCA_from_ChecklistBank.zip @jhammock Clarifications.
Question: But sorry, I don't understand the step where we deduplicate lists of values of several fields: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated) And letting the user make the mapping either a web form or a template file to download, fill in, and upload. Can you please explain this more :-) , thanks.
1-3 above check out, thanks
The columns mentioned in the confusing part cannot be copied directly into the output file, because its destination has a rigid controlled vocabulary. They may, however, contain useful information that should be included in the output file. Usually, the dataset creator will have used their own personal vocabulary for something like taxonomicStatus. What we usually do in a case like this for an EOL dataset- create a dictionary of likely text strings and a mapping to the controlled vocabulary- might work for many checklistbank datasets, and that's an option for this project.
I expect, though, that these strings may vary more widely than we're used to, so I thought it might be more robust to let the widget user help us create the mapping. Hence- we extract and deduplicate values for a column; that's the source strings for the mapping. The user fills in the output strings and hands the mapping back to the widget. The widget applies the mappings to the dwca file.
If this is not practical, let me know. Oh- if it helps, we could ask the user to select the output strings from a list, since I know the controlled vocabulary for the columns in question. If all versions of this idea are too many moving parts, likely to break, unwise for any reason, then I think our usual mapping method is a decent fallback option, in which case I can use the samples I've seen to make you a first draft mapping.
Does that help?
@jhammock Thanks!, I understand now. And yes, it will be nice after we provide the deduplicated raw values; we also show the correct controlled vocabulary list for each field as a guide for the user.
Cool. I'll get to work on the controlled vocabulary lists.
One more belated thing, for 3.ii : it's the reference that should be identified for the reconnection, one way or another, not the taxon. The relationship may be several taxa -> one reference.
FTR I'm not wedded to anyStyle if you want to try an alternative product. We could also try looking the references up (google scholar or something?) instead of parsing them, if that's an option. One of the issues I've run into is incomplete references (eg: title, author, date but no journal name) which might benefit from some reference -matching, so that could be a value add...
@jhammock I checked yesterday the other options for citation parser (ParaCite, ParCite etc.) and reference lookup like CrossRef. But I find AnyStyle parsing to be sound and high up on the list of parsers. Unfortunately I can only run it locally, until @JRice was able to install AnyStyle in eol-archive and also fix-up Ruby 2.5. Thanks Jeremy! Now we can use AnyStyle in the server and in our upcoming web form tool.
Jen, maybe we can use both a citation parser (AnyStyle) and a lookup (CrossRef or Google Scholar) for added value in our References output. We will see. Thanks.
Yup, that sounds good. Glad to hear AnyStyle is available to us.
Something new and different!
This is a potential source of datasets for EOL (geographic distribution and habitat data) and also for our colleagues at ITIS, who are mostly interested in the taxonomic data. The process of converting these files to ITIS import format will probably always require manual curation, whatever we do, but I'd like to offer them a couple of apps to make that process more efficient. This exercise may resemble the BOLD->iNat pipeline project.
Here's a sample dataset: https://www.checklistbank.org/dataset/1172/download
Anyone registered on the GBIF platform can contribute a dataset, and they do seem to vary in vocabulary and formatting within fields, but the sample that I looked at ( five datasets) are fairly consistently structured. I used these settings: format: dwca Choose root taxon: - Exclude ranks below: - Extended: yes Include synonyms: yes
The archive will usually include two files of interest: Taxon and Distribution. Taxon contains most of the target data. Two columns from Distribution should be merged in also- locality and occurrenceStatus, using taxonID as an index. In the Taxon file, references are kept in the namePublishedIn column. There's a lot of duplication in this column, and it will require a lot of processing, so I'd like to pull it out into a separate table and deduplicate, then merge the finished records back into the Taxon table after processing. This temporary table will need an index for that re-merge; the original contents of the namePublishedIn field would do for an index, but if you prefer something more formal, go ahead.
several columns will require mapping. As a starting point, I suggest we try presenting the user with deduplicated lists of values for each of these columns and letting them make the mapping, eg:
Taxon file, taxonomicStatus column: misapplied synonym accepted ambiguous synonym
This could be either a webform or a template file to download, fill in, and upload.
The columns to be mapped are: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated)