EOL / ContentImport

A placeholder for DATA tickets everytime Jira is un-available.
1 stars 1 forks source link

ChecklistBank pipelines #14

Open jhammock opened 5 months ago

jhammock commented 5 months ago

Something new and different!

This is a potential source of datasets for EOL (geographic distribution and habitat data) and also for our colleagues at ITIS, who are mostly interested in the taxonomic data. The process of converting these files to ITIS import format will probably always require manual curation, whatever we do, but I'd like to offer them a couple of apps to make that process more efficient. This exercise may resemble the BOLD->iNat pipeline project.

Here's a sample dataset: https://www.checklistbank.org/dataset/1172/download

Anyone registered on the GBIF platform can contribute a dataset, and they do seem to vary in vocabulary and formatting within fields, but the sample that I looked at ( five datasets) are fairly consistently structured. I used these settings: format: dwca Choose root taxon: - Exclude ranks below: - Extended: yes Include synonyms: yes

The archive will usually include two files of interest: Taxon and Distribution. Taxon contains most of the target data. Two columns from Distribution should be merged in also- locality and occurrenceStatus, using taxonID as an index. In the Taxon file, references are kept in the namePublishedIn column. There's a lot of duplication in this column, and it will require a lot of processing, so I'd like to pull it out into a separate table and deduplicate, then merge the finished records back into the Taxon table after processing. This temporary table will need an index for that re-merge; the original contents of the namePublishedIn field would do for an index, but if you prefer something more formal, go ahead.

several columns will require mapping. As a starting point, I suggest we try presenting the user with deduplicated lists of values for each of these columns and letting them make the mapping, eg:

Taxon file, taxonomicStatus column: misapplied synonym accepted ambiguous synonym

This could be either a webform or a template file to download, fill in, and upload.

The columns to be mapped are: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated)

jhammock commented 5 months ago

Overall field mapping for ITIS

(I'll edit this in place if needed.) Some of these fields can be mapped directly, some will require some checks to determine the destination, and some will come from the user-mapped columns.

dwc field | TWB field -- | -- dwc:taxonID | scientific_nameID dwc:parentNameUsageID | parent_nameID dwc:acceptedNameUsageID | accepted_nameID dwc:taxonomicStatus, nomenclaturalStatus, or taxonRemarks | name_usage and unacceptability_reason dwc:taxonRank | rank_name dwc:scientificNameAuthorship | taxon_author dwc:genericName | unit_name1 dwc:infragenericEpithet | unit_name2 dwc:specificEpithet | IF dwc:infragenericEpithet absent: unit_name2 \| IF dwc:infragenericEpithet present: unit_name3 dwc:infraspecificEpithet | IF dwc:infragenericEpithet absent: unit_name3 \| IF dwc:infragenericEpithet present: unit_name4 dwc:cultivarEpithet |  IF dwc:infragenericEpithet absent: unit_name3 \| IF dwc:infragenericEpithet present: unit_name4 dwc:namePublishedIn | PULL OUT INTO NEW TABLE dwc:locality | geographic_value dwc:occurrenceStatus | origin
jhammock commented 5 months ago

References (the namePublishedIn column) will be messy, so our aim here will be to make life a bit easier for a human reviewer. The ITIS bibliographic format is structured, in several fields, and a bit idiosyncratic. I presume using a bibliographic parser is the best first step. I used https://anystyle.io/, which was well reviewed in a couple of recent lists, but if you prefer another parser, send me the output from our sample dataset's references and we can do the mapping from there.

eliagbayani commented 4 months ago

DwCA_from_ChecklistBank.zip @jhammock Clarifications.

  1. So the task is for us to create a web form where our input is a DwCA generated by the CheckListBank web tool. https://www.checklistbank.org/dataset/1172/download
  2. Sample DwCA input is attached (DwCA_from_ChecklistBank.zip)
  3. The output of our form will be two files: 1st file: is the table you described here 2nd file: A References file with 2-columns (ReferenceID, Reference). Where the Reference is the deduplicated list of the Taxon!dwc:namePublishedIn The ReferenceID will be unique for this file as well. Then we have 2 options:
    1. either we use the ReferenceID to auto populate the field: Taxon!dwc:namePublishedIn
    2. or we create a 3rd column taxonIDs in this References file, which will be a pipe "|" separated values of taxonIDs

Question: But sorry, I don't understand the step where we deduplicate lists of values of several fields: -Taxon file: taxonomicStatus, taxonRank, nomenclaturalStatus (if populated), taxonRemarks -Distribution file: locality and occurrenceStatus (if present/populated) And letting the user make the mapping either a web form or a template file to download, fill in, and upload. Can you please explain this more :-) , thanks.

jhammock commented 4 months ago

1-3 above check out, thanks

The columns mentioned in the confusing part cannot be copied directly into the output file, because its destination has a rigid controlled vocabulary. They may, however, contain useful information that should be included in the output file. Usually, the dataset creator will have used their own personal vocabulary for something like taxonomicStatus. What we usually do in a case like this for an EOL dataset- create a dictionary of likely text strings and a mapping to the controlled vocabulary- might work for many checklistbank datasets, and that's an option for this project.

I expect, though, that these strings may vary more widely than we're used to, so I thought it might be more robust to let the widget user help us create the mapping. Hence- we extract and deduplicate values for a column; that's the source strings for the mapping. The user fills in the output strings and hands the mapping back to the widget. The widget applies the mappings to the dwca file.

If this is not practical, let me know. Oh- if it helps, we could ask the user to select the output strings from a list, since I know the controlled vocabulary for the columns in question. If all versions of this idea are too many moving parts, likely to break, unwise for any reason, then I think our usual mapping method is a decent fallback option, in which case I can use the samples I've seen to make you a first draft mapping.

Does that help?

eliagbayani commented 4 months ago

@jhammock Thanks!, I understand now. And yes, it will be nice after we provide the deduplicated raw values; we also show the correct controlled vocabulary list for each field as a guide for the user.

jhammock commented 4 months ago

Cool. I'll get to work on the controlled vocabulary lists.

One more belated thing, for 3.ii : it's the reference that should be identified for the reconnection, one way or another, not the taxon. The relationship may be several taxa -> one reference.

jhammock commented 4 months ago

FTR I'm not wedded to anyStyle if you want to try an alternative product. We could also try looking the references up (google scholar or something?) instead of parsing them, if that's an option. One of the issues I've run into is incomplete references (eg: title, author, date but no journal name) which might benefit from some reference -matching, so that could be a value add...

eliagbayani commented 4 months ago

@jhammock I checked yesterday the other options for citation parser (ParaCite, ParCite etc.) and reference lookup like CrossRef. But I find AnyStyle parsing to be sound and high up on the list of parsers. Unfortunately I can only run it locally, until @JRice was able to install AnyStyle in eol-archive and also fix-up Ruby 2.5. Thanks Jeremy! Now we can use AnyStyle in the server and in our upcoming web form tool.

Jen, maybe we can use both a citation parser (AnyStyle) and a lookup (CrossRef or Google Scholar) for added value in our References output. We will see. Thanks.

jhammock commented 4 months ago

Yup, that sounds good. Glad to hear AnyStyle is available to us.

eliagbayani commented 4 months ago

@jhammock Attached is the References for the given dataset we are testing. The first column "raw" is the original reference (untouched). The succeeding columns are generated/parsed by AnyStyle. This exercise is if we want to improve how AnyStyle behaves. If you see any columns from AnyStyle that needs fixing, please provide me the correct (author, title, container-title, etc.) value so I can train the parsing model to our needs. Thanks. References.txt

jhammock commented 4 months ago

Thanks, Eli! Sorry; I haven't forgotten that I owe you controlled vocabulary lists for those other fields.

Meanwhile: There's a special character issue, which, interestingly, I didn't encounter when I used the anystyle web interface, eg:

Palästinas -> Palästinas Södra -> Södra

jhammock commented 4 months ago

Here's an attempt at regular parsing feedback. I'm sort of following how the web interface invites feedback, but tell me what form/format would be best:

Chrétien P. Lés Lépidoptères du Maroc. Galleriinae - Micropterygidae. In: Oberthür C (Ed) Études de Lépidoptérologie comparée. Oberthür, Rennes, 324-379. (1922).

type- book title- Galleriinae - Micropterygidae container-title- Études de Lépidoptérologie comparée. publisher- Oberthür location- Rennes

(other fields were correct) obviously this one may have been affected by the special character issue. FWIW, I saw similar errors for books with my web-interface attempt when special characters were ok.

jhammock commented 4 months ago

Controlled vocabulary (always case sensitive):

taxonomicStatus, nomenclaturalStatus, or taxonRemarks (taxonomicStatus seems most commonly populated, but these will collapse into one output field. I'm game to treat collisions however is easiest for you; eg: we could establish a priority order. Use taxonomicStatus if populated, if not try nomenclaturalStatus, if not try taxonRemarks) (technically, we could sometimes narrow this down based on the ancestry of the record in question, see () below, but I wonder if that's practical. )

accepted (Archaeplastida, SAR (Stramenopiles, Alveolates, Rhizaria), Fungi) not accepted (Archaeplastida, SAR (Stramenopiles, Alveolates, Rhizaria), Fungi) valid (Metazoa, Bacteria, Archaea) invalid (Metazoa, Bacteria, Archaea)

unacceptability_reason

junior synonym original name/combination subsequent name/combination junior homonym homonym & junior synonym unavailable, database artifact unavailable, literature misspelling unavailable, incorrect orig.[inal] spelling unavailable, suppressed by ruling unavailable, nomen nudum unavailable, other unjustified emendation unnecessary replacement nomen oblitum misapplied pro parte other, see comments nomen dubium

taxonRank

Kingdom Subkingdom Infrakingdom Superdivision Superphylum Division Phylum Subdivision Subphylum Infradivision Infraphylum Parvdivision Parvphylum Superclass Class Subclass Infraclass Superorder Order Suborder Infraorder Section Subsection Superfamily Family Subfamily Tribe Subtribe Genus Subgenus Species Subspecies Variety Form Subvariety Race Stirp Morph Aberration Subform

locality

Antarctica/Southern Ocean North America Middle America Caribbean South America Europe & Northern Asia (excluding China) Africa Southern Asia Australia Oceania Eastern Atlantic Ocean Western Atlantic Ocean Indo-West Pacific East Pacific

occurrenceStatus

Native Introduced Native & Introduced Incidental Native & Extirpated Native & Extinct

eliagbayani commented 4 months ago

@jhammock Question. For example, dataset has these unique values for taxonomicStatus:

Our user will choose to map each of these values to our available mappings: 'accepted' or 'not accepted' Will the un-mapped values be excluded in the final table? Or since there is no mapping it will just remain as is? Thanks.

jhammock commented 4 months ago

Oh, good catch- I left out one of the most interesting bits. OK:

taxonomicStatus, nomenclaturalStatus, or taxonRemarks will actually populate up to two output columns. name_usage, which takes the accepted/unaccepted/valid/invalid values, and unacceptability_reason, which contains subcategories for unaccepted or invalid records. I'll edit the mapping table and vocabulary lists accordingly.

We could handle this a couple of ways. The constraints we want to meet are that unacceptability_reason must be empty if name_usage is accepted or valid, and it must be populated id name_usage is unaccepted or invalid. The source columns could contain information for either or both output columns. They might, for instance, indicate an accepted name; once that is mapped, we can infer that unacceptability_reason is blank. Or they might indicate one of the unacceptability_reason values, eg: "junior synonym", in which case we can infer that name_usage is unaccepted or invalid.

To help me narrow down my suggested approach- do you plan to offer all four name_usage values, or do you prefer to infer whether it's accepted/unaccepted or valid/invalid by the ancestry? Or, here's an idea, should we just provide a toggle for the user to select which system the records are in? If we did something like that, I'd be inclined, for this mapping, to offer a combined menu of controlled vocabulary, not including unaccepted/invalid, but inferring them if an unacceptability_reason is selected.

That's not as clear as I had hoped- but we'll get there. Please ask questions!

eliagbayani commented 4 months ago

@jhammock Thanks. Right now I didn't use ancestry to decide which system will be used but rather I checked the deduplicated list of taxonomicStatuses. If there is a value 'accepted' then I use the 'accepted' option otherwise I use the 'valid' option. BUT, this is not full-proof. Your suggestion to have a toggle on which the user can choose is better. And our widget can just suggest a default option using my deduplicated-list-check.

jhammock commented 4 months ago

Oh, good idea! (You've anticipated a scope-creep idea of mine to pre-populate the mapping if suggestive text strings are detected.) For now, though, I like your toggle-with-default-suggestion-if-available plan. That simplifies this 3 column -> 2 column situation. I can imagine a few ways of presenting this part of the mapping, but I'll wait to see your prototype.

eliagbayani commented 4 months ago

@jhammock You can now test the tool. https://editors.eol.org/eol_php_code/applications/CheckListBank_tool/main.php

What is missing for now:

  1. References.txt doesn't have the AnyStyle parts. I'm having problem running it in eol-archive for now. Will fix next.
  2. No toggle yet between 'Accepted' and 'Valid' options for the user. But the default toggle-with-default-suggestion-if-available is working.

Mapping exercise is working as well. The Taxa.txt has referenceID, as link to the References.txt. Thanks.

jhammock commented 4 months ago

Eli, this is splendid! Two comments thus far: you have a "working column" inadvertently appearing in the Taxa file: "pre_name_usage" I've realized belatedly that for ITIS import, there are fields that can be empty and others that cannot, which affects the mapping tool. I think the only change needed now is that the name_usage | unacceptability_reason section should not permit nulls. The outcome is that name_usage must be populated and if invalid/unaccepted, unacceptability_reason must also be populated.

Fingers crossed for the AnyStyle install...

eliagbayani commented 4 months ago

Hi Jen, Both issues are now fixed. Thanks.

jhammock commented 4 months ago

Nepticuloidea_references.tsv.zip

Working around the AnyStyle issue for now, let's append the final step of the process. Recall that in the references file, there would probably need to be manual curation anyway, after the AnyStyle and any other automated steps we add. The finished references file will include the column with the original dwc text, and some of the following columns:

reference_author title publication_name actual_pub_date listed_pub_date publisher pub_place pages isbn issn pub_comment

These columns are required to be present and populated:

reference_author publication_name actual_pub_date

Other columns may be present but should be ignored. Rows that don't contain the required elements can also be ignored, I think (until/unless we get feedback that some other treatment is preferred)

The column with the original dwc text can be called "raw" or whatever you like, to recognize it from the first step, which produced the starter references file. This final step is to insert the contents of the finished references file into the taxa file from the first step, using the original dwc text as an index. Does that make sense?

eliagbayani commented 4 months ago

Hi Jen, You said: "This final step is to insert the contents of the finished references file into the taxa file". Do you mean these fields and probably more will be added to our Taxa.txt as added columns in the final step?

jhammock commented 4 months ago

Correct! Any populated columns should become columns in taxa.txt. If you prefer, all 11 columns could be added and left empty if empty. Thanks!

eliagbayani commented 4 months ago

geographic_value.mhtml.zip @jhammock Hi Jen, Please see attached geographic_value.mhtml. It will be a big help to users if widget under the "geographic_value" can suggest values. For example if the string "South Africa" or "Ethiopia" is found then we default value to "Africa". Or if "Brazil" is found then we default to "South America". If you can give me list of countries or specific locations under a specific region then widget can provide default values. Thanks.

jhammock commented 4 months ago

Good idea! Let's start with this. I anticipate lots of tweaks, maybe just us, maybe also some beta users, so I'd advise making this a live doc of some kind. Feel free to move it to github or wherever. We may not want wide open access (only you and I can edit this copy today) but if we're lucky we'll get a few requests for edit access.

eliagbayani commented 4 months ago

geographic_value_2.mhtml.zip @jhammock Thanks. Your geographic_value assignments works nicely. I'm not sure if we want more of this but I saw this another example that maybe needs default values. See attached geographic_value_2.mhtml.zip. Thanks.

jhammock commented 4 months ago

hmm... I'm struggling with your attachment. What is it?

eliagbayani commented 4 months ago

Oh it is just more geographic_values that we might want to assign regions for so our widget can provide default values. more geographic_values

eliagbayani commented 4 months ago

Hi Jen, You can now test the tool.

Thanks.

jhammock commented 4 months ago

Yes, you're right, some of those realms can probably be mapped. Some of them are awkward semi-overlaps, but there are some we can get away with. Shall I add them to the google sheet or are you keeping them somewhere else?

This raises another case, too. If more than one output value is matched in one cell, none should be suggested, I think. This field doesn't seem to support multiple values. Can you enforce that?

Arg, and for now, it's also best to suggest nothing if either "Palearctic" or "Neotropic" is detected; those embrace multiple regions. I suppose the same treatment could go for "global", "worldwide" and "cosmopolitan". If you template how that should look in the mapping file, when a string, if detected, should rule out any suggestion, I'll continue to fill those in as I think of them.

jhammock commented 4 months ago

Oh yes, and generally, if an output value can't be found, geographic_value should be blank. This field is optional and many records won't be populated, but an unrecognized value will break the import to ITIS.

jhammock commented 4 months ago

Well, the product is grand, as I have come to expect of you :D

I have two more scope-creep items:

One of the options for unacceptability_reason is "other, see comments", which I found myself using a lot. It would be nice to allow the user to add that field (pub_comment) also, directly from the mapping. I'm not inclined to do anything fancy, like present the extra field only if "other, see comments" is selected. I could see users having selected a different value and still wanting to add a comment. This is an area where they may need to preserve information from the source file into their ITIS import, and deal with it downstream. Could we just present both unacceptability_reason and pub_comment for output throughout this section?

And for the references file, (which is great- thanks for getting started on the parsing while we wait for anyStyle), the manual intervention here is enough work that I can see some users wanting to port it to their preferred tool and then bring the output of that back to be inserted into the taxa file. Could we provide a route for that? Say, in addition to the completed taxa file and the refs file you're providing, include also a taxa file before refs insert, and an interface where the user can offer their own two files- taxa and refs- just for the refs insert step?

eliagbayani commented 4 months ago

@jhammock Thanks for the feedback Jen. I will proceed and just ask questions as I go along. And yes, we can just append below your google spreadsheet for additional mappings.

eliagbayani commented 4 months ago

Yes, you're right, some of those realms can probably be mapped. Some of them are awkward semi-overlaps, but there are some we can get away with. Shall I add them to the google sheet or are you keeping them somewhere else?

This raises another case, too. If more than one output value is matched in one cell, none should be suggested, I think. This field doesn't seem to support multiple values. Can you enforce that?

Arg, and for now, it's also best to suggest nothing if either "Palearctic" or "Neotropic" is detected; those embrace multiple regions. I suppose the same treatment could go for "global", "worldwide" and "cosmopolitan". If you template how that should look in the mapping file, when a string, if detected, should rule out any suggestion, I'll continue to fill those in as I think of them.

Jen, Yes, our widget can detect multiple hits for a certain location: e.g. [Bahrain] is found in both:

It is your call if we want to:

  1. pickup the first one (current implementation)
  2. don't suggest anything
  3. or we append both: e.g. Aruba: [North America, Caribbean] For case 3, it will be a text box and no longer a dropdown. What do you think? Thanks.
jhammock commented 4 months ago

Excellent! We may get feedback otherwise, but let's start by not suggesting anything in Multiple Match cases.

eliagbayani commented 4 months ago

Hi Jen, @jhammock You can now test the tool with latest changes.

geographic_value.

unacceptability_reason

Others

To do:

Thanks.

jhammock commented 3 months ago

Those updates all work as expected, thanks! I forgot one more thing. For any record where no output value is produced for geographic_value, origin should be left blank.

eliagbayani commented 3 months ago

Hi Jen, @jhammock You can now check tool with latest changes.

jhammock commented 3 months ago

This looks grand!

Oh, for heaven's sake. I forgot this because it's so much more comfortable to read as it is: the output files should all be pipe separated.

No, sorry, that's wrong- only the files for import into ITIS need that- Taxa_final from either interface. They also shouldn't include working columns, which I believe will choke the import process, so referenceID will have to go. That feels like a loss of information; I wonder if we should just make two versions- leave this one as it is and add a new file, say, Taxa_formatted, with the pipes and without the ref IDs.

jhammock commented 3 months ago

I just tried our sample file and I'm still seeing "Native" values (origin column) for records where geographic_value is blank. Cache, maybe?

eliagbayani commented 3 months ago

I just tried our sample file and I'm still seeing "Native" values (origin column) for records where geographic_value is blank. Cache, maybe?

My bad, working on it.

eliagbayani commented 3 months ago

This looks grand!

Oh, for heaven's sake. I forgot this because it's so much more comfortable to read as it is: the output files should all be pipe separated.

No, sorry, that's wrong- only the files for import into ITIS need that- Taxa_final from either interface. They also shouldn't include working columns, which I believe will choke the import process, so referenceID will have to go. That feels like a loss of information; I wonder if we should just make two versions- leave this one as it is and add a new file, say, Taxa_formatted, with the pipes and without the ref IDs.

Yes, two versions sound good. Will add Taxa_formatted to both interface, with pipes and without the ref IDs. Thanks.

eliagbayani commented 3 months ago

I just tried our sample file and I'm still seeing "Native" values (origin column) for records where geographic_value is blank. Cache, maybe?

This is now fixed. Thanks.

eliagbayani commented 3 months ago

Hi Jen, @jhammock Taxa_formatted (pipe-delimited, without the referenceID column) now available for both interfaces. Thanks.

jhammock commented 3 months ago

Sweet! This is all as expected, but I imagine we'll have more iterations once we've got some user feedback. More soon!

jhammock commented 3 months ago

We have a mysterious error! David from ITIS reports:

I go here https://editors.eol.org/eol_php_code/applications/CheckListBank_tool/main.php Click browse select DwCA_from_ChecklistBank.zip click convert archive file to ITIS format => invalid file type error:

InvalFileType

When I try the same file, received from David, it works! DwCA_from_ChecklistBank (1).zip

eliagbayani commented 3 months ago

@jhammock It seems if you zip a folder with contents, it doesn't work. But if you select the files then zip them, it works. That is the status for now. But I'm now working on having the zipped folder work also. Thanks.

jhammock commented 3 months ago

A few of the ITIS users have tried it now with the same problem. They don't think they have a folder zipped into their zip archives. What gave you the impression that was the issue? FWIW they wonder if there's a security layer between them and the widget that is interfering.

jhammock commented 3 months ago

I just tried to access the BOLD-iNat widget to gather tangential clues. I get a Forbidden error at https://editors.eol.org/eol_php_code/applications/BOLD2iNAT/ now. If it's not a clue, it's probably not very important, as that tool will, I imagine, be migrated shortly with everything else...

eliagbayani commented 3 months ago

Hi @jhammock, You can ask 1 ITIS user to try the tool again. I just simply added their file type to be allowed: "application/x-zip-compressed" Thanks.