CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Undefined codes #583

Open aoern opened 9 months ago

aoern commented 9 months ago

There are 101.000+ low taxon names (species or infraspecific) and 64.000+ high taxon names without a defined nomenclatural Code.

Examples of datasets that totally lack nomenclatural codes: Gymnodinium Tortricid.net WCVP-Fabaceae Microsporidia Strange enough, WCVP-Fabaceae however lists the codes of all subspecies taxa (1800+).

There are also several datasets that list the nomenclatural codes of low taxa but not of generic taxa and higher.

In most cases the correct code is easy to deduce, but because of missing codes it is impossible to design a general purpose script or program that needs to know the code, for example a species name syntax checker.

mdoering commented 9 months ago

There are 3 places to fix this.

1) we have control over the source archives and can declare the code in the coldp default.yaml file.

2) If an entire source dataset follows a single code we should declare that in the dataset settings and it would be applied to all its name upon the next import - which we should then trigger.

3) Otherwise we can also set the code in every sector setting - which allows to apply a code also to mixed sources such as ITIS.

yroskov commented 9 months ago

It's not nice to learn that some GSDs have problems with Codes again. A year ago (or so) all GSDs in CoL have been associated with Codes. We'll have a look on the problem again.

Well, neither @gdower nor I have a control over Tortricid.net and WCVP-Fabaceae. These GSDs were converted and imported in CLB by other people. (@dhobern, could you please apply the ICZN to Tortricid.net during re-import?)

Whereas, Gymnodinium & Microsporidia have been moved in CLB from AC19. We'll see with Geoff, what we can do here.

mdoering commented 9 months ago

like I said above there are various ways to fix this, you don't need to change the archive files if you don't have access

yroskov commented 9 months ago

WCVP-Fabaceae already has "botanical' Code in the settings.

image

I do re-sync now (2023-11-13)

mdoering commented 9 months ago

the dataset setting is for imports, not syncs...

mdoering commented 9 months ago

But the Fabaceae dataset does have the code imported already: https://www.checklistbank.org/dataset/2304/names?NOM_CODE=botanical

yroskov commented 9 months ago

@gdower, @mdoering, could you please describe here what exactly you have done today with Gymnodinium, Microsporidia and WCVP-Fabaceae, Tortricid.net? I.e. we need to document (1) which fix of 3 was applied, (2) was re-import completed, (3) was re-sync completed.

GSD Name Where the Code set up re-imported re-synced
WCVP-Fabaceae No actions: the Code was already defined in ver. 2023v.4 / 2023-08-02 no re-synced 2023-11-13
Gymnodinium set up for the sector no re-synced 2023-11-13
Microsporidia set up for the sector no re-synced 2023-11-13
Tortricid.net set up for GSD re-imported 2023-11-13 re-synced 2023-11-13
yroskov commented 9 months ago

@aoern, you gave 4 GSDs as an example. Do you have a full list of GSDs where Code is not defined?

mdoering commented 9 months ago

I only added code=zoological to the Tortricid.net dataset options and reimported it

gdower commented 9 months ago

@yroskov and I added code to the sectors for Gymnodinium and Microsporidia and resynced them.

dhobern commented 9 months ago

Thanks - should this property move into (or be duplicated in) the main metadata page/document? Or maybe the Options tab should be merged into the Metadata tab but only visible to those editing the page or to administrators. It could then be made a mandatory field for completion.

mdoering commented 9 months ago

That is how we started, but I feel it is cleaner to separate configurations/settings from metadata that informs you but is not used for interpretations.

Apart from the nomenclatural code you can also declare extinct or the default environment in the settings. And to be honest I would not mind to be even more generic in the future and allow defaults for any coldp or dwc term just as we do with default.yaml. Maybe splitting default values from other settings helps to visualise them on the main metadata page?

yroskov commented 9 months ago

allow defaults for any coldp or dwc term

I am supporting this idea. But: please use controlled vocabularies, where interface user have an option to chose value from the list.

yroskov commented 9 months ago

Does this report http://api.checklistbank.org/dataset/3/nameusage/search?nomCode=_NULL&facet=rank&facet=SECTOR_DATASET_KEY have an answer to the question, which GSDs in CoL still have no Code assignment?

How to assign Code to the nodes in the management classification (there are no sectors there)?

aoern commented 9 months ago

@yroskov, apart from the 4 GSDs mentioned earlier (Gymnodinium, Tortricid.net, CVP-Fabaceae, and Microsporidia) there are no other datasets with specific and infraspecific taxa without a defined code. However, there area still 56.000+ high rank taxa without a defined code in several datasets.

yroskov commented 9 months ago

...there are still 56.000+ high rank taxa without a defined code in several datasets.

Thank you, @aoern! It looks like majority of them belong to the "management classification", i.e. they are above/outside sectors and GSDs.

@mdoering, do we have a mechanism to assign Code to the taxa outside GSD sectors? Not sure that "3 places to fix this" work for management classification.

Separate question, what to do with taxa of ranks which are not regulated by the Code (ICZN regulates names from species-group to family-group and not above)?

DaveNicolson commented 9 months ago

ICZN does regulate names above superfamily, just in fewer ways. So they still have to be properly published, but they are not subject to Priority, for example. Art. 1.2.2 notes the Articles that apply for names above the family-group ranks.

aoern commented 9 months ago

@yroskov, you write that "It looks like majority of them belong to the "management classification"". However, source 0 and IRMNG count only to 4.000+ cases and they are not included in these 56.000 cases.

yroskov commented 9 months ago

source 0 and IRMNG count only to 4.000+ cases and they are not included in these 56.000 cases.

That is interesting. If these taxa are inside sectors, @mdoering, we need a list of parent GSDs where Code is not designated yet. There's not a quick way to get the info out of the API, as @gdower found.

mdoering commented 9 months ago

Does this report http://api.checklistbank.org/dataset/3/nameusage/search?nomCode=_NULL&facet=rank&facet=SECTOR_DATASET_KEY have an answer to the question, which GSDs in CoL still have no Code assignment?

Yes, the facet tells you which source datasets (sectorDatasetKey) and ranks are involved. If you limit the search to zero you only get the facet to view: http://api.checklistbank.org/dataset/3/nameusage/search?nomCode=_NULL&facet=rank&facet=SECTOR_DATASET_KEY&limit=0

You can see that most come from PBDB. Then SF+.

Btw, you can also add the filter nomCode=_NULL to the UI search URL and it uses it to filter even though the filter is not present in the forms yet (ping @thomasstjerne, see https://github.com/CatalogueOfLife/checklistbank/issues/1319).

mdoering commented 9 months ago

How to assign Code to the nodes in the management classification (there are no sectors there)?

You can define the code in the name editor. There is no bulk tool yet, but I am sure we'll need it. Well, checking the taxon forms there does not seem to be a code field:

image

@thomasstjerne that would be need in the future. Also extinct and really all the other name & taxon fields too, at least in an advanced section: https://github.com/CatalogueOfLife/checklistbank/issues/1320

TonyRees commented 9 months ago

If anything needs adjusting in IRMNG, I am happy to assist. I think nomenclature Code is added via a script at the data export stage, it is not in my edit interface.

On Wed, 15 Nov 2023, 7:19 am Markus Döring, @.***> wrote:

Does this report http://api.checklistbank.org/dataset/3/nameusage/search?nomCode=_NULL&facet=rank&facet=SECTOR_DATASET_KEY have an answer to the question, which GSDs in CoL still have no Code assignment?

Yes, the facet tells you which source datasets (sectorDatasetKey) and ranks are involved. If you limit the search to zero you only get the facet to view: http://api.checklistbank.org/dataset/3/nameusage/search?nomCode=_NULL&facet=rank&facet=SECTOR_DATASET_KEY&limit=0

You can see that most come from PBDB https://www.checklistbank.org/dataset/268676/names?nomCode=_NULL. Then SF+ https://www.checklistbank.org/dataset/2073/names?nomCode=_NULL.

Btw, you can also add the filter nomCode=_NULL to the UI search URL and it uses it to filter even though the filter is not present in the forms yet (ping @thomasstjerne https://github.com/thomasstjerne, see CatalogueOfLife/checklistbank#1319 https://github.com/CatalogueOfLife/checklistbank/issues/1319).

How to assign Code to the nodes in the management classification (there are no sectors there)?

You can define the code in the name editor. There is no bulk tool yet, but I am sure we'll need it.

— Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/data/issues/583#issuecomment-1811182573, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDXIXLH5XI6M2SNSQ7AZXDYEPG4NAVCNFSM6AAAAAA7HLISFCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJRGE4DENJXGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aoern commented 9 months ago

More info about missing code GSDs: There are tens of GSDs that contain taxa without nomenclatural code:

Species Fungorum Plus 14210 taxa Systema Dipterorum 11456 FishBase 5758 WSC 4390 CilCat 1262 and many many more

yroskov commented 9 months ago

Well, Species Fungorum Plus, Systema Dipterorum, FishBase, WSC, CilCat, all have defined Code of Nomenclature in the Options (and had it before their imports). It means, we are doing Sisyphean labor. The problem is not in data, but in the CLB code.

mdoering commented 9 months ago

Ah, looking at Fishbase examples I see that these are all names not present directly as a record on it own in the source, but are names with origin=denormalised, i.e. they are found only in the flat, higher classification and need to be extracted. The code doing that probably does not add any default values

gdower commented 9 months ago

The code doing that probably does not add any default values

Related: I was testing the default values and for ACEF imports with extinct set true in the dataset setting, the importer doesn't set the extinct flag on denormalized records.

mdoering commented 9 months ago

fixed in code now, but not will be deployed to prod only tomorrow: https://github.com/CatalogueOfLife/backend/commit/435d236048251c8c89a939a5cfb39146f003d22c