gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
18 stars 4 forks source link

Avoid canonical parsing to just "spec." #6

Open mdoering opened 6 years ago

mdoering commented 6 years ago

By investigating into gbif/checklistbank#45 we found there are 693.344 unique name records in CLB that all have the canonical name "spec.". Initial scanning of the names suggests this is both a name parsing issue and how the canonical name is build.

Here are a few:

prod_checklistbank=> SELECT * FROM name n WHERE lower(n.canonical_name) = lower('spec.') limit 50;
   id    |                  scientific_name                  | canonical_name |   type   | genus_or_above | infra_generic | specific_epithet | infra_specific_epithet | cultivar_epithet | notho_type | authors_parsed | autho
---------+---------------------------------------------------+----------------+----------+----------------+---------------+------------------+------------------------+------------------+------------+----------------+------
 4912135 | Gyrocarpus sp. Chase 317                          | spec.          | INFORMAL | Gyrocarpus     | Chase         |                  |                        |                  |            | f              |      
 4912136 | Gyrocarpus sp. DES-2011                           | spec.          | INFORMAL | Gyrocarpus     | Des-          |                  |                        |                  |            | t              |      
 4912338 | Gyrodactylus sp. AGV-2009a                        | spec.          | INFORMAL | Gyrodactylus   | Agv-          |                  |                        |                  |            | t              |      
 4912339 | Gyrodactylus sp. AGV-2009b                        | spec.          | INFORMAL | Gyrodactylus   | Agv-          |                  |                        |                  |            | t              |      
 4912343 | Gyrodactylus sp. Chile                            | spec.          | INFORMAL | Gyrodactylus   | Chile         |                  |                        |                  |            | t              |      
 4912363 | Gyrodactylus sp. HSS-2009                         | spec.          | INFORMAL | Gyrodactylus   | Hss-          |                  |                        |                  |            | t              |      
 4912385 | Gyrodactylus sp. Ladoga                           | spec.          | INFORMAL | Gyrodactylus   | Ladoga        |                  |                        |                  |            | t              |      
 4912386 | Gyrodactylus sp. Ladoga x Gyrodactylus pannonicus | spec.          | HYBRID   |                |               |                  |                        |                  |            | f              |      
 4912388 | Gyrodactylus sp. MBS-2014                         | spec.          | INFORMAL | Gyrodactylus   | Mbs-          |                  |                        |                  |            | t              |      
 4912390 | Gyrodactylus sp. MPV-2015                         | spec.          | INFORMAL | Gyrodactylus   | Mpv-          |                  |                        |                  |            | t              |      
 4912394 | Gyrodactylus sp. NKA-2015                         | spec.          | INFORMAL | Gyrodactylus   | Nka-          |                  |                        |                  |            | t              |      
 4912396 | Gyrodactylus sp. North Sea                        | spec.          | INFORMAL | Gyrodactylus   | North         |                  |                        |                  |            | t              | Sea  
 4912398 | Gyrodactylus sp. Norway-HH-2003                   | spec.          | INFORMAL | Gyrodactylus   | Norway-       |                  |                        |                  |            | f              |      
 4912407 | Gyrodactylus sp. Poland-MZ-2003                   | spec.          | INFORMAL | Gyrodactylus   | Poland-       |                  |                        |                  |            | f              |      
 4912413 | Gyrodactylus sp. Zimbabwe                         | spec.          | INFORMAL | Gyrodactylus   | Zimbabwe      |                  |                        |                  |            | t              |      
 4912428 | Gyrodactylus spec. Nordmann, 1832                 | spec.          | INFORMAL | Gyrodactylus   | Nordmann      |                  |                        |                  |            | t              |      
 4912565 | Gyrocotyle sp. Tasmania                           | spec.          | INFORMAL | Gyrocotyle     | Tasmania      |                  |                        |                  |            | t              |      
 4913863 | Gyrodinium sp. GeoB 231                           | spec.          | INFORMAL | Gyrodinium     | Geo           |                  |                        |                  |            | f              |      
 4913996 | Gyromitra sp. Gyr3                                | spec.          | INFORMAL | Gyromitra      | Gyr           |                  |                        |                  |            | f              |      
 4914171 | Gyrodactylus pannonicus X Gyrodactylus sp. Ladoga | spec.          | HYBRID   |                |               |                  |                        |                  |            | f              |      
 4914296 | Gyrodactylus pomeraniae x Gyrodactylus lavareti   | spec.          | HYBRID   |                |               |                  |                        |                  |            | f              |      
 4914360 | Gyroneuron sp. BOLD:AAI1989                       | spec.          | INFORMAL | Gyroneuron     | Bold          |                  |                        |                  |            | t              | Aai  
 4914380 | Gyroneuronella sp. AZR-2008                       | spec.          | INFORMAL | Gyroneuronella | Azr-          |                  |                        |                  |            | t              |      
 4914558 | Gyrodontium sp. BAB-5180                          | spec.          | INFORMAL | Gyrodontium    | Bab-          |                  |                        |                  |            | f              |      
 4915887 | Gyrostemon sp. Cranfield 02068672                 | spec.          | INFORMAL | Gyrostemon     | Cranfield     |                  |                        |                  |            | f              |      
 4916067 | Gyroporus sp. AWW-2009a                           | spec.          | INFORMAL | Gyroporus      | Aww-          |                  |                        |                  |            | t              |      
 4916070 | Gyroporus sp. Arora 00-429                        | spec.          | INFORMAL | Gyroporus      | Arora         |                  |                        |                  |            | f              |      
 4916072 | Gyroporus sp. Arora00-429                         | spec.          | INFORMAL | Gyroporus      | Arora         |                  |                        |                  |            | f              |      
 4916945 | Gyrovirus 4                                       | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916951 | Gyrovirus GyV3                                    | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916954 | Gyrovirus GyV7-SF                                 | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916956 | Gyrovirus GyV8                                    | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916959 | Gyrovirus GyV9                                    | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916962 | Gyrovirus Tu243                                   | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916965 | Gyrovirus Tu789                                   | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916977 | Gyrovirus: Chicken anemia virus ICTV              | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4916980 | Gyrovirus: chicken anemia virus Ictv              | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4917282 | Gyrophyllum sp. NTM-C014392                       | spec.          | INFORMAL | Gyrophyllum    | Ntm-          |                  |                        |                  |            | f              |      
 4917368 | Gyrtona sp. BOLD:AAI6410                          | spec.          | INFORMAL | Gyrtona        | Bold          |                  |                        |                  |            | f              |      
 4917371 | Gyrtona sp. Gyrt                                  | spec.          | INFORMAL | Gyrtona        | Gyrt          |                  |                        |                  |            | t              |      
 4917446 | Gyrtothripa sp. BOLD:AAH4619                      | spec.          | INFORMAL | Gyrtothripa    | Bold          |                  |                        |                  |            | f              |      
 4917650 | H-1 parvovirus                                    | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4917660 | H-Pelican lacZ transformation vector              | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4917662 | H-Stinger GFP transformation vector               | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      
 4918192 | HCBI8.215 virus                                   | spec.          | VIRUS    |                |               |                  |                        |                  |            | f              |      

The full list of all spec. names is attached.

mdoering commented 6 years ago

many of those names are from NCBI. Genus sp. XYZ is a very common structure we should detect and mark