gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
31 stars 14 forks source link

nub genusKey changes - duplicate genera #154

Closed timrobertson100 closed 3 years ago

timrobertson100 commented 3 years ago

This is one example of many, where a change in higher taxon results in all new keys being issued to sub-taxa. Is this expected?

To find other examples, select the "genusKey" changed in the "changes" column, and then on the "genus" column select "name has not changed" on this console

{
  "count": 1155,
  "verbatim_kingdom": "null",
  "verbatim_phylum": "Annelida",
  "verbatim_class": "Clitellata",
  "verbatim_order": "Haplotaxida",
  "verbatim_family": "Rhinodrilidae",
  "verbatim_genus": "Pontoscolex",
  "verbatim_species": "null",
  "verbatim_infra": "null",
  "verbatim_rank": "null",
  "verbatim_verbatimRank": "null",
  "verbatim_scientificName": "Bold:aaf0317",
  "verbatim_generic": "null",
  "verbatim_author": "null",
  "current_kingdom": "Animalia",
  "current_phylum": "Annelida",
  "current_class": "Clitellata",
  "current_order": "Haplotaxida",
  "current_family": "Glossoscolecidae",
  "current_genus": "Pontoscolex",
  "current_subGenus": "null",
  "current_species": "Pontoscolex corethrurus",
  "current_scientificName": "BOLD:AAF0317",
  "current_acceptedScientificName": "BOLD:AAF0317",
  "current_kingdomKey": 1,
  "current_phylumKey": 42,
  "current_classKey": 255,
  "current_orderKey": 472,
  "current_familyKey": 10227222,
  "current_genusKey": 2308740,
  "current_subGenusKey": "null",
  "current_speciesKey": 2308741,
  "current_taxonKey": 9845214,
  "current_acceptedTaxonKey": 9845214,
  "proposed_kingdom": "Animalia",
  "proposed_phylum": "Annelida",
  "proposed_class": "Clitellata",
  "proposed_order": "Opisthopora",
  "proposed_family": "Glossoscolecidae",
  "proposed_genus": "Pontoscolex",
  "proposed_subGenus": "null",
  "proposed_species": "Pontoscolex corethrurus",
  "proposed_scientificName": "BOLD:AAF0317",
  "proposed_acceptedScientificName": "BOLD:AAF0317",
  "proposed_kingdomKey": 1,
  "proposed_phylumKey": 42,
  "proposed_classKey": 255,
  "proposed_orderKey": 7190133,
  "proposed_familyKey": 10805791,
  "proposed_genusKey": 11630359,
  "proposed_subGenusKey": "null",
  "proposed_speciesKey": 2308741,
  "proposed_taxonKey": 11247452,
  "proposed_acceptedTaxonKey": 11247452,
  "_key": 15093,
  "changes": {
    "order": true,
    "orderKey": true,
    "familyKey": true,
    "genusKey": true,
    "taxonKey": true
  }
}
mdoering commented 3 years ago

I checked Ardea herodias with a genusKey issue and there are 3 deleted canonical names in the nub that might partly explain why ids have changed:

 2480935 |   2480922 | SPECIES    | DOUBTFUL | SOURCE | Ardea herodias                             | t       | 47f16512-bf31-410f-b272-d151c996b2f6
 9493523 |   2480922 | SPECIES    | DOUBTFUL | SOURCE | Ardea herodias                             | t       | 47f16512-bf31-410f-b272-d151c996b2f6
 9789784 |   2480922 | SPECIES    | DOUBTFUL | SOURCE | Ardea herodias                             | t       | 47f16512-bf31-410f-b272-d151c996b2f6
 9630752 |  11716345 | SPECIES    | ACCEPTED | SOURCE | Ardea herodias Linnaeus, 1758              | f       | 7ddf754f-d193-4cc9-b351-99906754a03b
mdoering commented 3 years ago

There are 2 copies of the genus, one without authorship directly attached to Animalia, one the real thing being used for the record in question:

    id    | parent_id | rank  |  status  |    origin     |           scientific_name           | deleted |           constituent_key            
----------+-----------+-------+----------+---------------+-------------------------------------+---------+--------------------------------------
  2480922 |         1 | GENUS | DOUBTFUL | IMPLICIT_NAME | Ardea                               | f       | 80b4b440-eaca-4860-aadf-d0dfdd3e856e
 11716345 |      3685 | GENUS | ACCEPTED | SOURCE        | Ardea Linnaeus, 1758                | f       | 7ddf754f-d193-4cc9-b351-99906754a03b
mdoering commented 3 years ago

The first implicit genus, i.e. created from a species, is from "Official Lists and Indexes of Names in Zoology", the 2nd one from COL. Why these are not merged is the question, they should really be the same. Might be related to the code change that allows for supra generic homonyms in COL, see also https://github.com/gbif/checklistbank/issues/123. Needs more investigation.

timrobertson100 commented 3 years ago

Presumably, it is the same thing that affects these 3000 classification records (many with no classification change, but a genusKey change)

image

mdoering commented 3 years ago

Genus counts by status:

NEW NUB:

 count  |       status        
--------+---------------------
 243782 | ACCEPTED
 146664 | DOUBTFUL
 125172 | SYNONYM
    354 | HETEROTYPIC_SYNONYM
    176 | HOMOTYPIC_SYNONYM
    676 | PROPARTE_SYNONYM
   1081 | MISAPPLIED

PREVIOUS:

 count  |       status        
--------+---------------------
 245197 | ACCEPTED
 146794 | DOUBTFUL
 114028 | SYNONYM
    407 | HETEROTYPIC_SYNONYM
     42 | HOMOTYPIC_SYNONYM
    603 | PROPARTE_SYNONYM
   1082 | MISAPPLIED

+/- the same amount of genera than in the previous version. Sounds like we might have had the same problem all the time.

mdoering commented 3 years ago

Comparing the largest occurrence change for a species with genusKey change Mimus polyglottos it appears that the authorship of the genus has changed between the last and the new version:

NEW

   id    | rank  |  status  | syn |    scientific_name     | size | pid  |    parent     |    family     |     order     |     class     |    phylum    | kingdom  
---------+-------+----------+-----+------------------------+------+------+---------------+---------------+---------------+---------------+--------------+----------
 9498764 | GENUS | ACCEPTED | f   | Mimus                  |    8 | 4239 | Curculionidae | Curculionidae | Coleoptera    | Insecta       | Arthropoda   | Animalia
 1175552 | GENUS | DOUBTFUL | f   | Mimus F.Boie, 1826     |   54 | 9321 | Mimidae       | Mimidae       | Passeriformes | Aves          | Chordata     | Animalia

PREVIOUS

   id    | rank  |  status  | syn |    scientific_name     | size | pid  |    parent     |    family     |     order     |     class     |    phylum    | kingdom  
---------+-------+----------+-----+------------------------+------+------+---------------+---------------+---------------+---------------+--------------+----------
 9498764 | GENUS | ACCEPTED | f   | Mimus                  |    8 | 4239 | Curculionidae | Curculionidae | Coleoptera    | Insecta       | Arthropoda   | Animalia
 2494919 | GENUS | DOUBTFUL | f   | Mimus Boie, 1826       |   60 | 9321 | Mimidae       | Mimidae       | Passeriformes | Aves          | Chordata     | Animalia
mdoering commented 3 years ago

One thing that explains this somewhat is that COL has changed the exact form of many authorship due to stronger standardisation of authors. This seems to lead to an explosion of new ids for genera where we already had a few IDs to chose from in the past.

mdoering commented 3 years ago

In the case of Mimus the tests work as expected, but there is another record Pseudofentonia gen. Mimus from The National Checklist of Taiwan that is causing the trouble!

Not sure if there is a single systematic problem here...

mdoering commented 3 years ago

There were various reasons for keys changing. The main one being that one the primary match for a given name was reusing old ids, but in case there are several ids for the same canonical name & kingdom the others never got used but instead a new id was issued. This is especially the case for genera which are most likely homonyms.

This was always like this, so the genusKey change is sth that we should have seen in all other builds before. Nevertheless I think I have improved it now in various places so all old keys can be reused. Building a new nub now to see how that goes.

mdoering commented 3 years ago

Well, the new code in the latest build did not change too much. The build logs track all deleted, resurrected (formerly deleted) and newly created ids.

15:48:13 UTC backbonebuild-vh ~/logs/2021-01-27 $ wc -l *.txt 
 1047570 created.txt
 1017804 deleted.txt
   84876 resurrected.txt
 2150250 total

15:48:20 UTC backbonebuild-vh ~/logs/2021-01-21 $ wc -l *.txt 
 1059908 created.txt
 1020725 deleted.txt
   74488 resurrected.txt
 2155121 total

15:48:37 UTC backbonebuild-vh ~/logs/2020-12-19 $ wc -l *.txt 
  954704 created.txt
 1014373 deleted.txt
   52038 resurrected.txt
 2021115 total

13:27:54 UTC static-vh /var/www/html/hosted-datasets.gbif.org/datasets/backbone/2019-09-06 $ wc -l *.txt
  844773 created.txt
  186188 deleted.txt
   33679 resurrected.txt
 1064640 total

13:32:49 UTC static-vh /var/www/html/hosted-datasets.gbif.org/datasets/backbone/2018-06-20/test $ wc -l *.txt
  180017 created.txt
  140324 deleted.txt
   19687 resurrected.txt
  340028 total
mdoering commented 3 years ago

The last 2019 & 2018 backbones deleted some 140 - 186 thousand IDs. This time we are about to delete over a million! Sth not right...

mdoering commented 3 years ago

in 2019 we deleted these number of taxa by rank:

CLASS 7
ORDER 5
FAMILY 67
GENUS 16.504
SPECIES 81.857
SUBSPECIES 11.426
VARIETY 6.931
FORM 2.252
UNRANKED 78.565

The latest build does:

PHYLUM 5
CLASS 31
ORDER 100
FAMILY 199
GENUS 8.309
SPECIES 317.426
SUBSPECIES 47.182
VARIETY 46.663
FORM 11.878
UNRANKED 633.193

The vast majority in the latest build in unranked and species. This could very well be OTUs... Lots of the higher ranks are Bacteria, also not surprising with GTDB added now.

mdoering commented 3 years ago

The ID reporting is partly wrong in the latest builds. It reports duplicates and also IDs that have been deleted before, e.g.:

8434341 GENUS   Abramidopsis Siebold 1863
8713183 GENUS   Abramidopsis Siebold 1863
9231386 GENUS   Abramidopsis Siebold 1863
8434341 GENUS   Abramidopsis Siebold 1863
8713183 GENUS   Abramidopsis Siebold 1863
10669563        GENUS   Abramidopsis Siebold 1863

There are only 2 real ones out of the 6 reported:

https://www.gbif.org/species/8434341 https://www.gbif.org/species/8713183 https://www.gbif.org/species/9231386 https://www.gbif.org/species/10669563

mdoering commented 3 years ago

with the OTU / unparsed name fix for stable ID things look much better (down by 60%), but there are still more than twice as many deleted IDs than in previous builds. Needs to be investigated still:

  466248 created.txt
  411382 deleted.txt
   82654 resurrected.txt
  960284 total
mdoering commented 3 years ago

With the new fix in passing in the deletion date to the stable ID generator we get lower again:

  475938 created.txt
  382481 deleted.txt
   90756 resurrected.txt
  949175 total
mdoering commented 3 years ago

With the 3 pass id issuing we snap to the right genera keys and have this:

  472410 created.txt
  382559 deleted.txt
   94481 resurrected.txt
  949450 total
timrobertson100 commented 3 years ago

What does deleted mean here please? Does it really mean that 40% of IDs in the backbone are removed?

mdoering commented 3 years ago

it means they existing in the last version and were removed in this version. But that is not 40% of all ids - we have 6.586.621 IDs in the current version and 1.151.430 previously deleted IDs. So that is 5.8% of all IDs have been deleted.

Still 382.559 is a large number, also when comparing to previous deletions which were at 140k to 180k (see above). I will investigate if there is a clear pattern what was removed.

mdoering commented 3 years ago

We did remove the GBIF Type Specimen dataset this time which accounted for 229,155 taxa in the last Backbone! That leaves 153.404 deletions which are in the same range as before.

mdoering commented 3 years ago
PHYLUM 5
CLASS 31
ORDER 101
FAMILY 155
GENUS 5.949
SPECIES 268.647
SUBSPECIES 44.549
VARIETY 43.573
FORM 11.372
UNRANKED 52.726
mdoering commented 3 years ago

unranked contains both BOLD (3736) and SH (141) ids, but the vast majority (48.862) there are basionym placeholder names with a quesiton mark in front like ? welleri Girty 1909. Looks like we can improve the id matching there easily - but it should not effect occurrence matching as I doubt any of those is used.

mdoering commented 3 years ago

Indeed some IDs not kept stable in that basionym placeholder field:

NEW BUILD:
clb=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^\? aberrans' order by scientific_name limit 10;
    id    |   rank   |      status       |        origin        | del |       scientific_name        
----------+----------+-------------------+----------------------+-----+------------------------------
  8515032 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t   | ? aberrans Dautzenberg, 1910
 10958716 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Fontaine, 1961
  9618184 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | t   | ? aberrans Fontaine, 1961
 10973045 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Gravenhorst, 1829
  8398445 | UNRANKED | SYNONYM           | BASIONYM_PLACEHOLDER | t   | ? aberrans Gravenhorst, 1829
 10939764 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Koken, 1889
  9418777 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | t   | ? aberrans Koken, 1889
 10995018 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Kossmat, 1895
  9528308 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | t   | ? aberrans Kossmat, 1895
  9477272 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | t   | ? aberrans Martynov, 1938
(10 rows)

OLD BUILD:
clb_src=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^\? aberrans' order by scientific_name limit 10;
    id    |   rank   |      status       |        origin        | del |           scientific_name           
----------+----------+-------------------+----------------------+-----+-------------------------------------
  8515032 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t   | ? aberrans Dautzenberg, 1910
  9618184 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Fontaine, 1961
  8398445 | UNRANKED | SYNONYM           | BASIONYM_PLACEHOLDER | f   | ? aberrans Gravenhorst, 1829
  9418777 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Koken, 1889
  9528308 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Kossmat, 1895
  9477272 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Martynov, 1938
  9516073 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Meunier, 1904
 10169333 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | f   | ? aberrans Miller & Unklesbay, 1942
  7535769 | UNRANKED | HOMOTYPIC_SYNONYM | BASIONYM_PLACEHOLDER | t   | ? aberrans Oudemans, 1932
  7858545 | UNRANKED | DOUBTFUL          | BASIONYM_PLACEHOLDER | t   | ? aberrans Peck
(10 rows)
mdoering commented 3 years ago

The largest 4 genera with this issue still are in Aves of course:

Nycticorax should be Nycticorax T.Forster which is correct in this nub, but wrong in our previous! So the genusKey change makes sense here as we actually changed the authorship for good.

Numenius should be Numenius Brisson, 1760 which we have in the latest build. It was wrong with Numenius Moehring, 1758 from COL in the previous!

Tadorna should be Tadorna F.Boie also good. It was wrongly Tadorna Von Oken before.

Phalaropus should be Phalaropus Brisson, 1760, good. Was badly Phalaropus Latham before!

=> All changes make sense and are for good!

mdoering commented 3 years ago

same with Elymus (plants).

Pteridium (fern) doesn't look great though:

COL has it like it is in the latest build. The type genus of Pteridaceae is not Pteridium as one could think, but Pteris which is still in Pteridaceae.

The previous Pteridium (Dennstaedtiaceae) & Pteridium Gleditsch. should be the same. The new build uses Pteridium Gled. ex Scop. instead of Gleditsch, maybe the author comparison did not work here as we get a new ID.

Not ideal, but better than before I would say. NOW:

clb=> select id,rank,status,scientific_name,size,family,phylum,kingdom from nubcl where scientific_name ~ '^Pteridium' and rank ='GENUS' order by kingdom, scientific_name;
   id    | rank  |  status  |             scientific_name              | size |      family      |    phylum    | kingdom  
---------+-------+----------+------------------------------------------+------+------------------+--------------+----------
 6786839 | GENUS | DOUBTFUL | Pteridium De Filippi & Vérany, 1857      |    0 | Bythitidae       | Chordata     | Animalia
 3243087 | GENUS | SYNONYM  | Pteridium Gürich, 1930                   |    0 | Pteridiniidae    |              | Animalia
 2661727 | GENUS | SYNONYM  | Pteridium (Kützing, 1843) J.Agardh, 1898 |    0 | Delesseriaceae   | Rhodophyta   | Plantae
 5275011 | GENUS | ACCEPTED | Pteridium Gled. ex Scop.                 |   39 | Dennstaedtiaceae | Tracheophyta | Plantae
 8091908 | GENUS | DOUBTFUL | Pteridium Gleditsch                      |    0 | Lindsaeaceae     | Tracheophyta | Plantae
 6008853 | GENUS | SYNONYM  | Pteridium Raf.                           |    0 | Pteridaceae      | Tracheophyta | Plantae
(6 rows)

WAS:

clb_src=> select id,rank,status,scientific_name,size,family,phylum,kingdom from nubcl where scientific_name ~ '^Pteridium' and rank ='GENUS' order by kingdom, scientific_name;
    id    | rank  |  status  |           scientific_name           | size |      family      |    phylum    | kingdom  
----------+-------+----------+-------------------------------------+------+------------------+--------------+----------
  6786839 | GENUS | DOUBTFUL | Pteridium De Filippi & Vérany, 1857 |    0 | Bythitidae       | Chordata     | Animalia
  3243087 | GENUS | SYNONYM  | Pteridium Gürich, 1930              |    0 | Pteridiniidae    |              | Animalia
  2345776 | GENUS | SYNONYM  | Pteridium Scopoli, 1777             |    0 | Bramidae         | Chordata     | Animalia
 10634127 | GENUS | DOUBTFUL | Pteridium                           |    2 | Dennstaedtiaceae | Tracheophyta | Plantae
  8091908 | GENUS | ACCEPTED | Pteridium Gleditsch.                |   41 | Dennstaedtiaceae | Tracheophyta | Plantae
  2661727 | GENUS | DOUBTFUL | Pteridium J.Agardh, 1898            |    0 | Delesseriaceae   | Rhodophyta   | Plantae
  6008853 | GENUS | SYNONYM  | Pteridium Raf.                      |    0 | Pteridaceae      | Tracheophyta | Plantae
(7 rows)

There were also 3 more deleted records before, one of which carries the ID for the newly accepted Pteridium Gled. ex Scop.:

clb_src=> select id,rank,status,origin,del,scientific_name from nub2 where scientific_name ~ '^Pteridium' and rank ='GENUS' order by id;
    id    | rank  |  status  |    origin     | del |           scientific_name           
----------+-------+----------+---------------+-----+-------------------------------------
  2345776 | GENUS | SYNONYM  | SOURCE        | f   | Pteridium Scopoli, 1777
  2661727 | GENUS | DOUBTFUL | SOURCE        | f   | Pteridium J.Agardh, 1898
  3243087 | GENUS | SYNONYM  | SOURCE        | f   | Pteridium Gürich, 1930
  5275011 | GENUS | ACCEPTED | SOURCE        | t   | Pteridium Gled. ex Scop.
  6008853 | GENUS | SYNONYM  | SOURCE        | f   | Pteridium Raf.
  6786839 | GENUS | DOUBTFUL | SOURCE        | f   | Pteridium De Filippi & Vérany, 1857
  8091908 | GENUS | ACCEPTED | SOURCE        | f   | Pteridium Gleditsch.
  9744831 | GENUS | ACCEPTED | SOURCE        | t   | Pteridium
  9794466 | GENUS | DOUBTFUL | SOURCE        | t   | Pteridium
 10634127 | GENUS | DOUBTFUL | IMPLICIT_NAME | f   | Pteridium
(10 rows)