Closed timrobertson100 closed 3 years ago
I checked Ardea herodias with a genusKey issue and there are 3 deleted canonical names in the nub that might partly explain why ids have changed:
2480935 | 2480922 | SPECIES | DOUBTFUL | SOURCE | Ardea herodias | t | 47f16512-bf31-410f-b272-d151c996b2f6
9493523 | 2480922 | SPECIES | DOUBTFUL | SOURCE | Ardea herodias | t | 47f16512-bf31-410f-b272-d151c996b2f6
9789784 | 2480922 | SPECIES | DOUBTFUL | SOURCE | Ardea herodias | t | 47f16512-bf31-410f-b272-d151c996b2f6
9630752 | 11716345 | SPECIES | ACCEPTED | SOURCE | Ardea herodias Linnaeus, 1758 | f | 7ddf754f-d193-4cc9-b351-99906754a03b
There are 2 copies of the genus, one without authorship directly attached to Animalia, one the real thing being used for the record in question:
id | parent_id | rank | status | origin | scientific_name | deleted | constituent_key
----------+-----------+-------+----------+---------------+-------------------------------------+---------+--------------------------------------
2480922 | 1 | GENUS | DOUBTFUL | IMPLICIT_NAME | Ardea | f | 80b4b440-eaca-4860-aadf-d0dfdd3e856e
11716345 | 3685 | GENUS | ACCEPTED | SOURCE | Ardea Linnaeus, 1758 | f | 7ddf754f-d193-4cc9-b351-99906754a03b
The first implicit genus, i.e. created from a species, is from "Official Lists and Indexes of Names in Zoology", the 2nd one from COL. Why these are not merged is the question, they should really be the same. Might be related to the code change that allows for supra generic homonyms in COL, see also https://github.com/gbif/checklistbank/issues/123. Needs more investigation.
Presumably, it is the same thing that affects these 3000 classification records (many with no classification change, but a genusKey change)
Genus counts by status:
NEW NUB:
count | status
--------+---------------------
243782 | ACCEPTED
146664 | DOUBTFUL
125172 | SYNONYM
354 | HETEROTYPIC_SYNONYM
176 | HOMOTYPIC_SYNONYM
676 | PROPARTE_SYNONYM
1081 | MISAPPLIED
PREVIOUS:
count | status
--------+---------------------
245197 | ACCEPTED
146794 | DOUBTFUL
114028 | SYNONYM
407 | HETEROTYPIC_SYNONYM
42 | HOMOTYPIC_SYNONYM
603 | PROPARTE_SYNONYM
1082 | MISAPPLIED
+/- the same amount of genera than in the previous version. Sounds like we might have had the same problem all the time.
Comparing the largest occurrence change for a species with genusKey change Mimus polyglottos
it appears that the authorship of the genus has changed between the last and the new version:
NEW
id | rank | status | syn | scientific_name | size | pid | parent | family | order | class | phylum | kingdom
---------+-------+----------+-----+------------------------+------+------+---------------+---------------+---------------+---------------+--------------+----------
9498764 | GENUS | ACCEPTED | f | Mimus | 8 | 4239 | Curculionidae | Curculionidae | Coleoptera | Insecta | Arthropoda | Animalia
1175552 | GENUS | DOUBTFUL | f | Mimus F.Boie, 1826 | 54 | 9321 | Mimidae | Mimidae | Passeriformes | Aves | Chordata | Animalia
PREVIOUS
id | rank | status | syn | scientific_name | size | pid | parent | family | order | class | phylum | kingdom
---------+-------+----------+-----+------------------------+------+------+---------------+---------------+---------------+---------------+--------------+----------
9498764 | GENUS | ACCEPTED | f | Mimus | 8 | 4239 | Curculionidae | Curculionidae | Coleoptera | Insecta | Arthropoda | Animalia
2494919 | GENUS | DOUBTFUL | f | Mimus Boie, 1826 | 60 | 9321 | Mimidae | Mimidae | Passeriformes | Aves | Chordata | Animalia
One thing that explains this somewhat is that COL has changed the exact form of many authorship due to stronger standardisation of authors. This seems to lead to an explosion of new ids for genera where we already had a few IDs to chose from in the past.
In the case of Mimus the tests work as expected, but there is another record Pseudofentonia gen. Mimus
from The National Checklist of Taiwan that is causing the trouble!
Not sure if there is a single systematic problem here...
There were various reasons for keys changing. The main one being that one the primary match for a given name was reusing old ids, but in case there are several ids for the same canonical name & kingdom the others never got used but instead a new id was issued. This is especially the case for genera which are most likely homonyms.
This was always like this, so the genusKey change is sth that we should have seen in all other builds before. Nevertheless I think I have improved it now in various places so all old keys can be reused. Building a new nub now to see how that goes.
Well, the new code in the latest build did not change too much. The build logs track all deleted, resurrected (formerly deleted) and newly created ids.
15:48:13 UTC backbonebuild-vh ~/logs/2021-01-27 $ wc -l *.txt
1047570 created.txt
1017804 deleted.txt
84876 resurrected.txt
2150250 total
15:48:20 UTC backbonebuild-vh ~/logs/2021-01-21 $ wc -l *.txt
1059908 created.txt
1020725 deleted.txt
74488 resurrected.txt
2155121 total
15:48:37 UTC backbonebuild-vh ~/logs/2020-12-19 $ wc -l *.txt
954704 created.txt
1014373 deleted.txt
52038 resurrected.txt
2021115 total
13:27:54 UTC static-vh /var/www/html/hosted-datasets.gbif.org/datasets/backbone/2019-09-06 $ wc -l *.txt
844773 created.txt
186188 deleted.txt
33679 resurrected.txt
1064640 total
13:32:49 UTC static-vh /var/www/html/hosted-datasets.gbif.org/datasets/backbone/2018-06-20/test $ wc -l *.txt
180017 created.txt
140324 deleted.txt
19687 resurrected.txt
340028 total
The last 2019 & 2018 backbones deleted some 140 - 186 thousand IDs. This time we are about to delete over a million! Sth not right...
in 2019 we deleted these number of taxa by rank:
CLASS 7
ORDER 5
FAMILY 67
GENUS 16.504
SPECIES 81.857
SUBSPECIES 11.426
VARIETY 6.931
FORM 2.252
UNRANKED 78.565
The latest build does:
PHYLUM 5
CLASS 31
ORDER 100
FAMILY 199
GENUS 8.309
SPECIES 317.426
SUBSPECIES 47.182
VARIETY 46.663
FORM 11.878
UNRANKED 633.193
The vast majority in the latest build in unranked and species. This could very well be OTUs... Lots of the higher ranks are Bacteria, also not surprising with GTDB added now.
The ID reporting is partly wrong in the latest builds. It reports duplicates and also IDs that have been deleted before, e.g.:
8434341 GENUS Abramidopsis Siebold 1863
8713183 GENUS Abramidopsis Siebold 1863
9231386 GENUS Abramidopsis Siebold 1863
8434341 GENUS Abramidopsis Siebold 1863
8713183 GENUS Abramidopsis Siebold 1863
10669563 GENUS Abramidopsis Siebold 1863
There are only 2 real ones out of the 6 reported:
https://www.gbif.org/species/8434341 https://www.gbif.org/species/8713183 https://www.gbif.org/species/9231386 https://www.gbif.org/species/10669563
with the OTU / unparsed name fix for stable ID things look much better (down by 60%), but there are still more than twice as many deleted IDs than in previous builds. Needs to be investigated still:
466248 created.txt
411382 deleted.txt
82654 resurrected.txt
960284 total
With the new fix in passing in the deletion date to the stable ID generator we get lower again:
475938 created.txt
382481 deleted.txt
90756 resurrected.txt
949175 total
With the 3 pass id issuing we snap to the right genera keys and have this:
472410 created.txt
382559 deleted.txt
94481 resurrected.txt
949450 total
What does deleted mean here please? Does it really mean that 40% of IDs in the backbone are removed?
it means they existing in the last version and were removed in this version. But that is not 40% of all ids - we have 6.586.621 IDs in the current version and 1.151.430 previously deleted IDs. So that is 5.8% of all IDs have been deleted.
Still 382.559 is a large number, also when comparing to previous deletions which were at 140k to 180k (see above). I will investigate if there is a clear pattern what was removed.
We did remove the GBIF Type Specimen dataset this time which accounted for 229,155 taxa in the last Backbone! That leaves 153.404 deletions which are in the same range as before.
PHYLUM 5
CLASS 31
ORDER 101
FAMILY 155
GENUS 5.949
SPECIES 268.647
SUBSPECIES 44.549
VARIETY 43.573
FORM 11.372
UNRANKED 52.726
unranked contains both BOLD (3736) and SH (141) ids, but the vast majority (48.862) there are basionym placeholder names with a quesiton mark in front like ? welleri Girty 1909
. Looks like we can improve the id matching there easily - but it should not effect occurrence matching as I doubt any of those is used.
Indeed some IDs not kept stable in that basionym placeholder field:
NEW BUILD:
clb=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^\? aberrans' order by scientific_name limit 10;
id | rank | status | origin | del | scientific_name
----------+----------+-------------------+----------------------+-----+------------------------------
8515032 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t | ? aberrans Dautzenberg, 1910
10958716 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Fontaine, 1961
9618184 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | t | ? aberrans Fontaine, 1961
10973045 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Gravenhorst, 1829
8398445 | UNRANKED | SYNONYM | BASIONYM_PLACEHOLDER | t | ? aberrans Gravenhorst, 1829
10939764 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Koken, 1889
9418777 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | t | ? aberrans Koken, 1889
10995018 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Kossmat, 1895
9528308 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | t | ? aberrans Kossmat, 1895
9477272 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | t | ? aberrans Martynov, 1938
(10 rows)
OLD BUILD:
clb_src=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^\? aberrans' order by scientific_name limit 10;
id | rank | status | origin | del | scientific_name
----------+----------+-------------------+----------------------+-----+-------------------------------------
8515032 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t | ? aberrans Dautzenberg, 1910
9618184 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Fontaine, 1961
8398445 | UNRANKED | SYNONYM | BASIONYM_PLACEHOLDER | f | ? aberrans Gravenhorst, 1829
9418777 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Koken, 1889
9528308 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Kossmat, 1895
9477272 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Martynov, 1938
9516073 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Meunier, 1904
10169333 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | f | ? aberrans Miller & Unklesbay, 1942
7535769 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t | ? aberrans Oudemans, 1932
7858545 | UNRANKED | DOUBTFUL | BASIONYM_PLACEHOLDER | t | ? aberrans Peck
(10 rows)
The largest 4 genera with this issue still are in Aves of course:
Nycticorax should be Nycticorax T.Forster which is correct in this nub, but wrong in our previous! So the genusKey change makes sense here as we actually changed the authorship for good.
Numenius should be Numenius Brisson, 1760 which we have in the latest build. It was wrong with Numenius Moehring, 1758 from COL in the previous!
Tadorna should be Tadorna F.Boie also good. It was wrongly Tadorna Von Oken before.
Phalaropus should be Phalaropus Brisson, 1760, good. Was badly Phalaropus Latham before!
=> All changes make sense and are for good!
same with Elymus (plants).
Pteridium (fern) doesn't look great though:
NOW 4 genera Pteridium, Polypodiales
WAS 7 genera !!
COL has it like it is in the latest build. The type genus of Pteridaceae is not Pteridium as one could think, but Pteris which is still in Pteridaceae.
The previous Pteridium (Dennstaedtiaceae) & Pteridium Gleditsch. should be the same. The new build uses Pteridium Gled. ex Scop. instead of Gleditsch, maybe the author comparison did not work here as we get a new ID.
Not ideal, but better than before I would say. NOW:
clb=> select id,rank,status,scientific_name,size,family,phylum,kingdom from nubcl where scientific_name ~ '^Pteridium' and rank ='GENUS' order by kingdom, scientific_name;
id | rank | status | scientific_name | size | family | phylum | kingdom
---------+-------+----------+------------------------------------------+------+------------------+--------------+----------
6786839 | GENUS | DOUBTFUL | Pteridium De Filippi & Vérany, 1857 | 0 | Bythitidae | Chordata | Animalia
3243087 | GENUS | SYNONYM | Pteridium Gürich, 1930 | 0 | Pteridiniidae | | Animalia
2661727 | GENUS | SYNONYM | Pteridium (Kützing, 1843) J.Agardh, 1898 | 0 | Delesseriaceae | Rhodophyta | Plantae
5275011 | GENUS | ACCEPTED | Pteridium Gled. ex Scop. | 39 | Dennstaedtiaceae | Tracheophyta | Plantae
8091908 | GENUS | DOUBTFUL | Pteridium Gleditsch | 0 | Lindsaeaceae | Tracheophyta | Plantae
6008853 | GENUS | SYNONYM | Pteridium Raf. | 0 | Pteridaceae | Tracheophyta | Plantae
(6 rows)
WAS:
clb_src=> select id,rank,status,scientific_name,size,family,phylum,kingdom from nubcl where scientific_name ~ '^Pteridium' and rank ='GENUS' order by kingdom, scientific_name;
id | rank | status | scientific_name | size | family | phylum | kingdom
----------+-------+----------+-------------------------------------+------+------------------+--------------+----------
6786839 | GENUS | DOUBTFUL | Pteridium De Filippi & Vérany, 1857 | 0 | Bythitidae | Chordata | Animalia
3243087 | GENUS | SYNONYM | Pteridium Gürich, 1930 | 0 | Pteridiniidae | | Animalia
2345776 | GENUS | SYNONYM | Pteridium Scopoli, 1777 | 0 | Bramidae | Chordata | Animalia
10634127 | GENUS | DOUBTFUL | Pteridium | 2 | Dennstaedtiaceae | Tracheophyta | Plantae
8091908 | GENUS | ACCEPTED | Pteridium Gleditsch. | 41 | Dennstaedtiaceae | Tracheophyta | Plantae
2661727 | GENUS | DOUBTFUL | Pteridium J.Agardh, 1898 | 0 | Delesseriaceae | Rhodophyta | Plantae
6008853 | GENUS | SYNONYM | Pteridium Raf. | 0 | Pteridaceae | Tracheophyta | Plantae
(7 rows)
There were also 3 more deleted records before, one of which carries the ID for the newly accepted Pteridium Gled. ex Scop.:
clb_src=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^Pteridium' and rank ='GENUS' order by id;
id | rank | status | origin | del | scientific_name
----------+-------+----------+---------------+-----+-------------------------------------
2345776 | GENUS | SYNONYM | SOURCE | f | Pteridium Scopoli, 1777
2661727 | GENUS | DOUBTFUL | SOURCE | f | Pteridium J.Agardh, 1898
3243087 | GENUS | SYNONYM | SOURCE | f | Pteridium Gürich, 1930
5275011 | GENUS | ACCEPTED | SOURCE | t | Pteridium Gled. ex Scop.
6008853 | GENUS | SYNONYM | SOURCE | f | Pteridium Raf.
6786839 | GENUS | DOUBTFUL | SOURCE | f | Pteridium De Filippi & Vérany, 1857
8091908 | GENUS | ACCEPTED | SOURCE | f | Pteridium Gleditsch.
9744831 | GENUS | ACCEPTED | SOURCE | t | Pteridium
9794466 | GENUS | DOUBTFUL | SOURCE | t | Pteridium
10634127 | GENUS | DOUBTFUL | IMPLICIT_NAME | f | Pteridium
(10 rows)
This is one example of many, where a change in higher taxon results in all new keys being issued to sub-taxa. Is this expected?
To find other examples, select the "genusKey" changed in the "changes" column, and then on the "genus" column select "name has not changed" on this console