CatalogueOfLife / testing

Editorial tests and discussion to prepare for COL releases
2 stars 0 forks source link

WoRMS Foraminifera (id 1157): test report #61

Open yroskov opened 3 years ago

yroskov commented 3 years ago

WoRMS Foraminifera, id 1157 on prod https://data.catalogueoflife.org/catalogue/3/dataset/1157

yroskov commented 3 years ago

ISSUES (selected only) 2021-03-22

yroskov commented 3 years ago

TASKS 2021-03-22

image

https://data.catalogueoflife.org/catalogue/3/dataset/1157/duplicates?authorshipDifferent=false&category=binomial&limit=500&minSize=2&mode=STRICT&offset=0&status=accepted&withDecision=false

image

Resolved 2021-03-22:

image

yroskov commented 3 years ago

Synced 2021-03-19. Sync need to be repeated after metadata fixes

bart-v commented 3 years ago

We are not sending these names/taxa as accepted... i.e. first example Globorotalia acostaensis Blow, 1959 https://data.catalogueoflife.org/dataset/1157/taxon/x3XH http://www.marinespecies.org/aphia.php?p=taxdetails&id=894416

This is something CoL makes up...instead of using the existing name from WoRMS https://data.catalogueoflife.org/dataset/1157/name/x3XG placeholder I find this a worrying evolution...

It's caused by an accepted subspecies under an unaccepted species (in this case Globorotalia acostaensis subsp. trochoidea Bizon & Bizon, 1965)

I know CoL does not like this, but still dangerous to just create entries...

yroskov commented 3 years ago

Woops! Indeed, these are species names created by the clearinghouse. OK, they should be blocked in CoL.

For attention of @mdoering

mdoering commented 3 years ago

I am not sure if I fully understand the problem.

There is an accepted subspecies Globorotalia acostaensis subsp. trochoidea Bizon & Bizon, 1965 † under the species synonym Globorotalia acostaensis Blow, 1959 † which has an accepted species Neogloboquadrina acostaensis (Blow, 1959) †.

This indeed is an unsupported tree and against the ColDP or DwC format. Nevertheless ChecklistBank should be able to handle those cases and flag them. I would expect the accepted subspecies to be moved under the accepted species Neogloboquadrina acostaensis.

The record for species acostaensis has "origin":"denormed classification". That indicates there was species=acostaensi given with that name in the Taxon ColDP file? Checking the Taxon file I can see the species column indeed holds only the species epithet, when it should be the full binomial:

The species binomial the taxon is classified in. If parentID is given this field is ignored.

I would recommend not to mix the flat classification with the parentID terms. Chose one or the other - unless you need the mix to complement missing higher ranks, e.g. you have no kingdom record in your database. If you can provide parentID to build up a parent child relation this is the recommended way in ColDP. This is safer and results in quicker imports too.

mdoering commented 3 years ago

As the flat ranks should be ignored if parentID is given you normally do not notice the problem. But the Foraminifera dataset has 7478 invalid parentIDs, in which case the system switches to the flat classification for those records.

mdoering commented 3 years ago

There are 2259 usages created because of the flat classification, i.e. origin=denormed classification:

"usagesByOriginCount":{
  "source":73164,
  "denormed classification":2259
}

https://api.catalogueoflife.org/dataset/1157/import

We apparently miss a search filter/facet for origin - we will add that so they become searchable easily.

bart-v commented 3 years ago

(1) The invalid parentIDs are they are, this is not something we can easily fix. Plus, Aphia does allow/support this. We will report this to our editors, but it will need time to get fixed.

(2) Regarding parentIDs vs. flat classification We prefer to have keep both, as that allows quick and easy inspection and comparison of records.

mdoering commented 3 years ago

@bart-v in case there is an invalid parentID, how do you extract any classification yourself? Aphia seems to follow a parent child model with a parent column. Should those records that have no parent not be entirely orphaned names?

Example accepted genus Asymetria Georgescu, 2012 with an invalid parentID:

Biota Chromista (Kingdom) Harosa (Subkingdom) Rhizaria (Infrakingdom) Foraminifera (Phylum) Foraminifera incertae sedis (Class) Asymetria † (Genus)

520751 seems to exist in WoRMS, so why is it not part of the ColDP archive?

It is given as a Name record: https://data.catalogueoflife.org/dataset/1157/verbatim?q=520751&type=col%3AName

But not as a Taxon/Usage record. There are just many records that have it as its parent: https://data.catalogueoflife.org/dataset/1157/verbatim?offset=0&q=520751&type=col%3ATaxon

Is it maybe because it is considered a temporary name? We should probably assign all temporary names some status and not just leave a remark. Can that be done? Right now the best match seems "nom. inval.", but we might want to add another status for temporary/informal/placeholder names so we can clearly identify them

bart-v commented 3 years ago

In Aphia, a parentID can perfectly be invalid, we just follow the parents until Biota, accepted or not.

AphiaID 520751 is not in Taxon.txt, because it's not an accepted name, while you require it: https://github.com/CatalogueOfLife/coldp#taxon

Taxon
An accepted name with a taxonomic classification(...)

This is not only about temporary names. There is many parents that are unaccepted...

mdoering commented 3 years ago

What is their status then? You use the name in the classification, so they should be considered a name usage, which is either a Taxon or Synonym in ColDP. Alternatively you could use the NameUsage table which does not differ between the two "subclasses". A Taxon record is not necessarily a fully accepted name. It can also be provisionally accepted. Maybe there is also the need to have a new Taxon status to represent your cases? Simply dropping the record is the worst solution I would think.

bart-v commented 3 years ago

There is no separate status for this: it can be any. Notice that we don't have the formal taxon concept in Aphia...

I don't think we can easily fix this right away, so I propose we drop them for now. If we inform our editors well, we can slowly tackle these issues...

mdoering commented 3 years ago

Excluding them entirely would be fine. But could we then also remove the name record and relink the parentID of the children to the ID of the next higher parent which is considered fine? That would produce a valid parent child tree structure again.

bart-v commented 3 years ago

I think this is not what we want, in theory you could end up with parent of a Genus as Biota or so. It would also involve some work at our side...

Let's just exclude them for now, and focus on getting at least some up-to-date data into CoL

mdoering commented 3 years ago

Not sure if I understand what you mean with exclude. Do you mean we should exclude them on our side? If so what should be excluded exactly?

bart-v commented 3 years ago

Just what Yuri did now: exclude or ignore on the CoL side

bart-v commented 3 years ago

The "invalid parent issue" has been fixed now. These will be added as provisional=1 Also the metadata is now fixed

Available in next export on 2021-06-01

yroskov commented 3 years ago

TASKS 2021-07-08

image

Resolved: image

yroskov commented 3 years ago

ver 2021-11-01

TASKS - there are changes image

Resolved 2021-11-04 image

yroskov commented 2 years ago

ver 2022-08-01

TASKS image

Resolved: image

Re-synced 2022-08-03

yroskov commented 1 year ago

ver 2023-05-01

TASKS

image

Resolved 2023-05-08

image

Re-synced 2023-05-08

yroskov commented 1 year ago

ver 2023-06-01

TASKS

image

Resolved 2023-06-05:

image

Re-synced 2023-06-05

yroskov commented 1 year ago

It caused by triple duplicates like this image

Only one name flagged if button Most Recent Name used. Button All Except Oldest should be used.

= FIXED 2023-06-09

Re-synced 2023-06-09

yroskov commented 6 months ago

ver 2024-05-01

TASKS

image

Resolved 2024-05-07:

image

Re-synced 2024-05-07

yroskov commented 5 months ago

ver 2024-06-01

TASKS

image

Resolved 2024-06-03:

image

Synced 2024-06-03

yroskov commented 1 month ago

2024-09-23: ranks "subvariety" & "subform" excluded from accepted taxa in the sector. 15 names affected Re-synced 2024-09-23