NaturalHistoryMuseum / scratchpads2

Scratchpads 2.0
http://scratchpads.org
GNU General Public License v2.0
199 stars 83 forks source link

Taxonomic scope for GBIF DwC #6373

Open benscott opened 3 years ago

benscott commented 3 years ago

Overview

Allow users to set taxonomic scope for DwC-A file.

Possible Solutions

Add basic text entry field to form admin/config/content/scratchpads_gbif_registry_client

If set, include this XML in the eml.xml file.

<coverage>
<taxonomicCoverage>
<generalTaxonomicCoverage>Myriapoda and Onychophora</generalTaxonomicCoverage>
</taxonomicCoverage>
</coverage>

Anything Else? For http://myriatrix.myspecies.info

therobyouknow commented 3 years ago

Thank you Ben, I have some questions:

1 - which form is this?

With the URL sub-path, you gave, if I use it in my local setup, e.g.: http://127.0.0.1:8080/admin/config/content/scratchpads_gbif_registry_client

I get the content authoring config page: image

2 - is the basic text entry field to be added to a form or an entity such as a content type or a custom database table?

If it's adding to a form: can you advise where the hook_form_alter is? If it's adding to an entity: can advise on the entity? (could it be one of the content types defined in sites/all/modules/custom/darwincore module? ) Or is it a custom database table, e.g. as defined here: sites/scratchpads-dev/sites/all/modules/custom/dwcarchiver/dwcarchiver.install?

3 - "If set, include this XML in the eml.xml file." - in the XML given:

<coverage>
<taxonomicCoverage>
<generalTaxonomicCoverage>Myriapoda and Onychophora</generalTaxonomicCoverage>
</taxonomicCoverage>
</coverage>

Is this XML always the above, i.e. exactly the same, whatever value is entered into the basic text entry field, or does the above XML serve as an example, with "Myriapoda and Onychophora" being an example of what would be entered in the basic entry text field? i.e. should the value entered in the field then appear in the XML enclosed by the <generalTaxonomicCoverage>... </generalTaxonomicCoverage> enclosing XML?

4 - is there is a generic function for generating the XML, so if I add the field in one of the ways determined in item 2 above, then the generic xml builder code will output this field value.

Thank you!

Archilegt commented 3 years ago

@therobyouknow , For reaching the form, you will first probably need to go to Structure->Tools->GBIF Registry, and then turn the GBIF Registry client on. When that tool is turned on, you should find a "GBIF registration settings" tool in the content authoring config page:

Picture1

Clicking on the tool takes you to the "GBIF registration settings" form:

Picture2

The task is to add the text field "Taxonomic coverage". I suggest displaying it above "Dataset description" by default, but maybe the users could be allowed to customize the display, as the form certainly needs additional fields to be added. The value entered in the "Taxonomic coverage" field, e.g., "Myriapoda and Onychophora", should then be stored in the eml.xml file of the DarwinCore Archive, annotated with the XML tags mentioned by @benscott Some documentation on how this works on GBIF's side: http://gbif.github.io/parsers/apidocs/org/gbif/api/model/registry/eml/TaxonomicCoverage.html If you look at that documentation, and additionally to this documentation (https://eml.ecoinformatics.org/schema/eml-coverage_xsd.html#TaxonomicCoverage), then you may come to the conclusion that the suggested tag <generalTaxonomicCoverage></generalTaxonomicCoverage> may not be what we need, but instead a tag like <taxonRankValue></taxonRankValue> as described here (https://eml.ecoinformatics.org/schema/eml-coverage_xsd.html#TaxonomicClassificationType_taxonRankValue)

It should be evaluated if the field should be a basic text field, or if it should allow comma-separated values that are later parsed, e.g. "Myriapoda and Onychophora" entered as "Myriapoda, Onychophora", then tagged as: `

Myriapoda Onychophora

`

taxonRankValue is meant to match scientificName on GBIF's side, and as described for DarwinCore (https://dwc.tdwg.org/terms/#dwc:scientificName) Anyway, I am no expert. Just trying to contribute to choosing the right field for the form and the right tagging for the eml.xml file. :)

Archilegt commented 3 years ago

Question for commenting with prettier "code": How can I produce line breaks when inserting code?

therobyouknow commented 3 years ago

Thank you so much for the very useful info @Archilegt ! Also colleague @benscott of course for the original issue description.

This looks very helpful detail @archilegt on initial look. Most appreciated and I will look forward to resuming this tomorrow morning. Thank you again!

And I will look at the line breaks question too.

benscott commented 3 years ago

Thank you @Archilegt, it's great to get your input! I'm far from an expert too, but looking into the information you sent and the documentation on GBIF's IPT (Ref: https://ipt.gbif.org/manual/en/ipt/2.5/gbif-metadata-profile#taxonomic-coverage) I think it's up to us. We can use generalTaxonomicCoverage for just a textual description; or taxonRankValue if we wanted to break it down; or both together.

In the first instance, I'm inclined to go for just the basic text description. It's simpler to do, and we can create another issue to scope out adding taxonRankValue alongside it (for example, should we validate the names against the GBIF API?). What do you think?

@therobyouknow - to answer some of your other questions...the form is scratchpads_gbif_registry_client_admin_settings, and its other values are just stored as variables. I would do the same for this new variable.

The EML is created in _dwcarchiver_get_eml - https://github.com/NaturalHistoryMuseum/scratchpads2/blob/cebb178c6ed02848c970e14b865a721dde7431da/sites/all/modules/custom/dwcarchiver/dwcarchiver.rebuild.inc

therobyouknow commented 3 years ago

Thank you @benscott !

therobyouknow commented 3 years ago

Thanks @benscott,

Confirming that I can see the function for form definition scratchpads_gbif_registry_client_admin_settings in file: sites/all/modules/custom/scratchpads/scratchpads_gbif_registry_client/scratchpads_gbif_registry_client.pages.inc

Confirming that I can add the text variable variable there.

Confirming I see that the EML is created in _dwcarchiver_get_eml - presumably this will see the new text variable field value and include it in the XML output, is that right?

Do I need to add anything to the text variable variable e.g. attributes to get the desired XML output for in the XML, e.g. like the example provided above:

<coverage>
<taxonomicCoverage>
<generalTaxonomicCoverage>Myriapoda and Onychophora</generalTaxonomicCoverage>
</taxonomicCoverage>
</coverage>

Thanks again.

therobyouknow commented 3 years ago

Pull request for @benscott to review please: https://github.com/NaturalHistoryMuseum/scratchpads2/pull/6410 Thank you :)

therobyouknow commented 3 years ago

testing steps and progress

testing - seems promising so far - xml contains the scope. Remaining item is to validate the eml.xml on https://www.gbif.org/tools/data-validator but this needs a login

test steps

1 - enter test data at: admin/config/content/scratchpads_gbif_registry_client

put Myriapoda and Onychophora in "DATASET DESCRIPTION" and "TAXONOMIC SCOPE FOR DWC-A FILE" fields. Tick "Enable GBIF registration" check box. Click "Save Configuration".

2 - go to admin/config/content/dwcarchiver . On "GBIF DwCA" row, click Download.

3 - A zip downloads, extract it and examine the eml.xml

4 - observe that it contains

<coverage>
  <taxonomicCoverage>
    <generalTaxonomicCoverage>Myriapoda and Onychophora</generalTaxonomicCoverage>
  </taxonomicCoverage>
</coverage>

as required - as Ben specified in initial comment above - https://github.com/NaturalHistoryMuseum/scratchpads2/issues/6373#issue-915372593

5 - validate using https://www.gbif.org/tools/data-validator (seems to require account).

Got to step 4 and confirmed it does contain the XML fragment, here's the full eml.xml file below. Need to look into getting an account to complete step 5.


<?xml version="1.0"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/terms/" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd" xml:lang="en" packageId="619a4b95-1a82-4006-be6a-7dbe3c9b33c5/v7" system="http://gbif.org" scope="system"><dataset xmlns=""><title xml:lang="en">Scratchpads Dev</title><abstract><para>Myriapoda and Onychophora</para></abstract><creator><individualName><givenName/><surName/></individualName><electronicMailAddress/></creator><contact><individualName><givenName/><surName/></individualName><electronicMailAddress/></contact><metadataProvider><individualName><givenName>Scratchpads</givenName><surName>Team</surName></individualName><address><deliveryPoint>Natural History Museum, Cromwell Road</deliveryPoint><city>London</city><administrativeArea>London</administrativeArea><postalCode>SW7 5BD</postalCode><country>UK</country></address><electronicMailAddress>scratchpads@nhm.ac.uk</electronicMailAddress><onlineUrl>http://scratchpads.org/</onlineUrl></metadataProvider><language>en</language><pubDate>2021-09-28</pubDate><distribution scope="document"><online><url function="information">http://127.0.0.1:8080/classification/4</url></online></distribution><coverage><taxonomicCoverage><taxonomicClassification><taxonRankValue>Pinus</taxonRankValue></taxonomicClassification></taxonomicCoverage></coverage><project><title>Scratchpads Dev</title></project><intellectualRights><para>This work is licensed under a Creative Commons 4.0 License - https://creativecommons.org/licenses/by/4.0/<ulink url="https://creativecommons.org/licenses/by/4.0/"/></para></intellectualRights><coverage><taxonomicCoverage><generalTaxonomicCoverage>Myriapoda and Onychophora</generalTaxonomicCoverage></taxonomicCoverage></coverage></dataset><additionalMetadata xmlns=""><metadata><gbif><dateStamp>2021-09-28T15:42:29</dateStamp><hierarchyLevel>dataset</hierarchyLevel><citation>http://127.0.0.1:8080/ - Scratchpads Dev</citation></gbif></metadata></additionalMetadata></eml:eml>

image

therobyouknow commented 3 years ago

Completed step 5- signed up with github for GBIF validator

Step 5 - went to https://www.gbif.org/tools/data-validator

uploaded the whole zip file downloaded from test step 2 - not just the XML file.

Validation report below.

(page: https://www.gbif.org/tools/data-validator/1631281283248 )

Summary at top of report says:

"The file can be indexed by GBIF" Some issues were detected by the validator:

Metadata Content The resource creator is missing or is incomplete GBIF Taxon Interpretation Vernacular name invalidScientificName assembled

@benscott is this sufficient for the pull request to pass testing? Thanks for your help.

image

Archilegt commented 3 years ago

@therobyouknow, you can visually check the validator results against an older validated version which is published in GBIF (https://www.gbif.org/en/dataset/994e75fa-b187-4b07-a30e-665f4acbe394). I think that before there was an automatic setup in which the admins of a given Scratchpads will be Originator/Administrative point of contact, and the Scratchpads team will be the metadata author (see GBIF link above). The information of all "resource creators" will then appear in the "Contacts" section of the GBIF dataset. The validator error "The resource creator is missing or is incomplete" may be due to any or all of the contacts missing. You could consider implementing a part manual, part automatic solution: 1) I suggest adding a field for manual setup of the Originator/Administrative point of contact, because not all admins of a given Scratchpad may want to be cited/should be cited as such. 2) I suggest adding the metadata author (Scratchpads) automatically. See also https://www.gbif.org/publisher/315b3c03-4a0a-424e-83a5-d25aa748e666

Archilegt commented 3 years ago

@therobyouknow, there is a phantom "Pinus" taxon in the Taxonomic coverage section at https://www.gbif.org/tools/data-validator/1631281283248/document. I don't know where that value is coming from, but it does not belong to Myriatrix's metadata.

therobyouknow commented 3 years ago

Thanks @Archilegt I've began comparing my eml.xml with the eml.xml file from your link https://www.gbif.org/en/dataset/994e75fa-b187-4b07-a30e-665f4acbe394

I run a comparison the 2 files side by side using Scooter Software's Beyond Compare - my eml.xml file is on the right. Differences are in red.

There are several differences as we can see but I would need to consult further with Ben and yourself to determine if they are signifcant.

Noted your comment about the phantom "Pinus" taxon - I will look into that also.

image

therobyouknow commented 3 years ago

@Archilegt thank you regarding your comment: https://github.com/NaturalHistoryMuseum/scratchpads2/issues/6373#issuecomment-931241823

Can you advise:

cc @benscott can you advise if the XML output from my proposed implementation looks correct (right hand side in screenshot compared with @Archilegt's example XML to compare against as reference).

Anything else you need from me to help you confirm if this issue has the correct implementation please let me know!

therobyouknow commented 2 years ago

released in 2.10.1 the functionality for this ticket. Tested this functionality on https://strumigenys.myspecies.info/ which now has the 2.10.1.

Test steps used: https://github.com/NaturalHistoryMuseum/scratchpads2/issues/6373#issuecomment-929309671