Open yroskov opened 3 years ago
ISSUES (selected only) 2021-03-22
TASKS 2021-03-22
Resolved 2021-03-22:
Synced 2021-03-19. Sync need to be repeated after metadata fixes
We are not sending these names/taxa as accepted... i.e. first example Globorotalia acostaensis Blow, 1959 https://data.catalogueoflife.org/dataset/1157/taxon/x3XH http://www.marinespecies.org/aphia.php?p=taxdetails&id=894416
This is something CoL makes up...instead of using the existing name from WoRMS
https://data.catalogueoflife.org/dataset/1157/name/x3XG
placeholder
I find this a worrying evolution...
It's caused by an accepted subspecies under an unaccepted species (in this case Globorotalia acostaensis subsp. trochoidea Bizon & Bizon, 1965)
I know CoL does not like this, but still dangerous to just create entries...
Woops! Indeed, these are species names created by the clearinghouse. OK, they should be blocked in CoL.
For attention of @mdoering
I am not sure if I fully understand the problem.
There is an accepted subspecies Globorotalia acostaensis subsp. trochoidea Bizon & Bizon, 1965 †
under the species synonym Globorotalia acostaensis Blow, 1959 †
which has an accepted species Neogloboquadrina acostaensis (Blow, 1959) †
.
This indeed is an unsupported tree and against the ColDP or DwC format. Nevertheless ChecklistBank should be able to handle those cases and flag them. I would expect the accepted subspecies to be moved under the accepted species Neogloboquadrina acostaensis.
The record for species acostaensis has "origin":"denormed classification".
That indicates there was species=acostaensi
given with that name in the Taxon ColDP file? Checking the Taxon file I can see the species column indeed holds only the species epithet, when it should be the full binomial:
The species binomial the taxon is classified in. If parentID is given this field is ignored.
I would recommend not to mix the flat classification with the parentID terms. Chose one or the other - unless you need the mix to complement missing higher ranks, e.g. you have no kingdom record in your database. If you can provide parentID to build up a parent child relation this is the recommended way in ColDP. This is safer and results in quicker imports too.
As the flat ranks should be ignored if parentID is given you normally do not notice the problem. But the Foraminifera dataset has 7478 invalid parentIDs, in which case the system switches to the flat classification for those records.
There are 2259 usages created because of the flat classification, i.e. origin=denormed classification
:
"usagesByOriginCount":{
"source":73164,
"denormed classification":2259
}
https://api.catalogueoflife.org/dataset/1157/import
We apparently miss a search filter/facet for origin - we will add that so they become searchable easily.
(1) The invalid parentIDs are they are, this is not something we can easily fix. Plus, Aphia does allow/support this. We will report this to our editors, but it will need time to get fixed.
(2) Regarding parentIDs vs. flat classification We prefer to have keep both, as that allows quick and easy inspection and comparison of records.
@bart-v in case there is an invalid parentID, how do you extract any classification yourself? Aphia seems to follow a parent child model with a parent column. Should those records that have no parent not be entirely orphaned names?
Example accepted genus Asymetria Georgescu, 2012
with an invalid parentID:
Biota Chromista (Kingdom) Harosa (Subkingdom) Rhizaria (Infrakingdom) Foraminifera (Phylum) Foraminifera incertae sedis (Class) Asymetria † (Genus)
520751 seems to exist in WoRMS, so why is it not part of the ColDP archive?
It is given as a Name record: https://data.catalogueoflife.org/dataset/1157/verbatim?q=520751&type=col%3AName
But not as a Taxon/Usage record. There are just many records that have it as its parent: https://data.catalogueoflife.org/dataset/1157/verbatim?offset=0&q=520751&type=col%3ATaxon
Is it maybe because it is considered a temporary name? We should probably assign all temporary names some status and not just leave a remark. Can that be done? Right now the best match seems "nom. inval.", but we might want to add another status for temporary/informal/placeholder names so we can clearly identify them
In Aphia, a parentID can perfectly be invalid, we just follow the parents until Biota, accepted or not.
AphiaID 520751 is not in Taxon.txt, because it's not an accepted name, while you require it: https://github.com/CatalogueOfLife/coldp#taxon
Taxon
An accepted name with a taxonomic classification(...)
This is not only about temporary names. There is many parents that are unaccepted...
What is their status then? You use the name in the classification, so they should be considered a name usage, which is either a Taxon or Synonym in ColDP. Alternatively you could use the NameUsage table which does not differ between the two "subclasses". A Taxon record is not necessarily a fully accepted name. It can also be provisionally accepted. Maybe there is also the need to have a new Taxon status to represent your cases? Simply dropping the record is the worst solution I would think.
There is no separate status for this: it can be any. Notice that we don't have the formal taxon concept in Aphia...
I don't think we can easily fix this right away, so I propose we drop them for now. If we inform our editors well, we can slowly tackle these issues...
Excluding them entirely would be fine. But could we then also remove the name record and relink the parentID of the children to the ID of the next higher parent which is considered fine? That would produce a valid parent child tree structure again.
I think this is not what we want, in theory you could end up with parent of a Genus as Biota or so. It would also involve some work at our side...
Let's just exclude them for now, and focus on getting at least some up-to-date data into CoL
Not sure if I understand what you mean with exclude. Do you mean we should exclude them on our side? If so what should be excluded exactly?
Just what Yuri did now: exclude or ignore on the CoL side
The "invalid parent issue" has been fixed now. These will be added as provisional=1 Also the metadata is now fixed
Available in next export on 2021-06-01
TASKS 2021-07-08
[x] Broken decisions: 708; re-matched all: 724. Deleted all.
[x] Duplicated uninomials are already marked as acc vs prov.acc
Resolved:
ver 2021-11-01
TASKS - there are changes
[x] Broken decisions, 15; RematchAll = 6 remains; deleted all
[x] Identical superfamily, family, genus = all cases marked as acc - prov acc by the provider. No decisions needed, except one case.
Resolved 2021-11-04
ver 2022-08-01
TASKS
Resolved:
Re-synced 2022-08-03
ver 2023-05-01
TASKS
Resolved 2023-05-08
Re-synced 2023-05-08
ver 2023-06-01
TASKS
Resolved 2023-06-05:
Re-synced 2023-06-05
It caused by triple duplicates like this
Only one name flagged if button Most Recent Name used. Button All Except Oldest should be used.
= FIXED 2023-06-09
Re-synced 2023-06-09
ver 2024-05-01
TASKS
Resolved 2024-05-07:
Re-synced 2024-05-07
ver 2024-06-01
TASKS
Resolved 2024-06-03:
Synced 2024-06-03
2024-09-23: ranks "subvariety" & "subform" excluded from accepted taxa in the sector. 15 names affected Re-synced 2024-09-23
WoRMS Foraminifera, id 1157 on prod https://data.catalogueoflife.org/catalogue/3/dataset/1157