gbif / backbone-feedback

2 stars 0 forks source link

Capitalizing the first letter in a name leads to 25K - 30K more unique name strings matching to the the backbone #236

Open jhnwllr opened 2 years ago

jhnwllr commented 2 years ago

For example,

Simply changing acerates viridiflora (raf.) eaton https://api.gbif.org/v1/species/match?name=acerates%20viridiflora%20(raf.)%20eaton to Acerates viridiflora (raf.) eaton https://api.gbif.org/v1/species/match?name=Acerates%20viridiflora%20(raf.)%20eaton leads to an exact match.

Running all of GBIF's unique uncapped first letter higherrank-flagged v_scientificname(s) (with a few other filters) through the name matcher produces the following table.

v_scientificname (first letter not capped in original) CAP first letter unique name count interpretation
HIGHERRANK EXACT 25115 25K names moved to exact match from higherrank with capitilization of first letter in name
HIGHERRANK FUZZY 5041 5K names moved to FUZZY match with capitilization of first letter in name
HIGHERRANK HIGHERRANK 68150 makes no difference
EXACT EXACT 3230 makes no difference
HIGHERRANK NONE 136
EXACT HIGHERRANK 7 7 names moved to higherrank from exact with capitilization of first letter in name
NONE HIGHERRANK 7
NONE EXACT 1
NONE FUZZY 3
FUZZY FUZZY 257

269,924 occurrences flagged as HIGHERRANK get moved to EXACT matchType when simply capitalizing the first letter of the name. 17,964 occurrences flagged as HIGHERRANK get moved to FUZZY matchType.

It appears that simply default capitalizing the first letter of v_scientificname might lead to many more matches to the GBIF backbone.

Some examples in the table below.

v_scientificname api_response
brugmansia arborea (l.) steud. link
elephantopus scaber var. tomentosus (l.) sch. bip. link
nicotiana langsdorffii weinmann link
conceveiba ptariana (steyerm.) jabl. link
thelypteris gigantea (mett.) tryon link
cinnamomum stenophyllum (meisn.) vattimo link
cassia cathartica var. paucijuga h. s. irwin & barneby link
mimosa regnellii var. grossiseta link
camponotus yogi_cf link
tetramorium mgm39 link
lastreopsis effusa (sw.) tindale. link
tetramorium fhg063 link
acianthera aveniformis (hoehne) c. n. gonç. & waechter link
lycopodium serpens link
camponotus afr011 link
goeppertia grandis (petersen) borchsenius & suárez link
wissadula gymnanthemum var. subtomentosa r.e.fr. link
bulbostylis capillaris var. tenuifolia (rudge) c.b. clarke link
pollalesta niceforoi (cuatrec.) aristeg. link
coussarea leptoloba (benth. & hook.f.) müll.arg. link
eleocharis interstincta (vahl) r. & s. link
cuphea ferrisiae var. rosea s.a. graham link
cuphea parietarioides (a.st.-hil.) koehne link
sida alpestris a.st.-hil. link
bulbostylis scabra cogn. link
rasahus SpMEL link
duranta buxifolia link
aphaenogaster umphreyi link
hymenostegia floribunda (benth.) harms link
bursera lancifolia (schltdl.) engl. link
anemopaegma cf. chamberlaynii (sims) bureau & k.schum. link
justicia procumbens l. link
myracrodruon urundeuva allemão & m.allemão link
stromanthe schottiana (koern.)eichler link
cuphea arenarioides var. muscosa link
candidate division_WS6_bacterium_GW2011_GWF2_39_15 link
triumfetta nemoralis a. st.-hil. link
colanthelia cingulata (mcclure & l.b.smith) mcclure link
hypoponera cg01 link
hypoponera ug12 link
cryptantha confertiflora (greene) pays. link
gamma proteobacterium_symbiont_of_Piezodorus_hybneri link
pothomorphe umbellata (l.)miq. link
protium confusum (rose) pittier link
irlbachia alata (aubl) maas link
pueraria phaseoloides (roxb.)benth. link
arabis grandiflora link
leandra trauninensis var. major link
chamaecrista glandulosa var. brasiliensis (vogelvogelvogelvogel) irwinirwinirwinirwin & barneby & barn link
rhododendron viscosum var. serrulatum (small) ahles link
ocotea lobbii (meissn.)rower link
anochetus ug05 link
pheidole fae01 link
pheidole gf037 link
solenopsis jtl003 link
pieris napi/balcana link
camponotus jdm1158_rubiginosus_group link
hornungia petraea (l.) rchb. link
stelis tridentata link
eurhopalothrix id03 link
herposiphonia tenella (agardh, c.) . link
melpomene moniliformis var. adnata (lag. ex sw.) a.r.sm. & r.c.moran; (kunze) m.lehnert link
baccharis trinervis (lam) pers. link
convolvulus fastigiatus link

@mdoering

dshorthouse commented 2 years ago

A very large source of these is AntWeb, https://www.gbif.org/dataset/13b70480-bd69-11dd-b15f-b8a03c50a862. Has anyone ever reached out to Brian Fisher and Jon Fong to inquire why so many of their scientific names are expressed like this https://www.gbif.org/occurrence/3501417310?

jhnwllr commented 2 years ago

These are the top datasets which do not capitalize the first letter in their v_scientificname, which leads to higherrank matches to the GBIF backbone.

datasettitle num unique names num occ link
MBM herbarium - Museu Botânico Municipal \ Curitiba - Herbário Virtual REFLORA 3970 14626 link
AntWeb 3108 33319 link
B herbarium - Botanischer Garten und Botanisches Museum Berlin-Dahlem Herbarium - Herbário Virtual REFLORA 1695 3168 link
CEPEC herbarium - Centro de Pesquisas do Cacau - Herbário Virtual REFLORA 1644 5201 link
P herbarium - Muséum national d’histoire naturelle, Paris - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 1509 4768 link
IAN herbarium - Embrapa Amazônia Oriental - Herbário Virtual REFLORA 982 3198 link
NY herbarium - The New York Botanical Garden - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 852 4695 link
US herbarium - Smithsonian Institute - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 815 3042 link
ALCB herbarium - Universidade Federal da Bahia - Herbário Virtual REFLORA 684 2929 link
SINGER Coordinator 557 62543 link
HUCP herbarium - Pontífica Universidade Católica do Paraná PUC - Herbário Virtual Reflora 513 2976 link
ESA herbarium - Universidade de São Paulo - Herbário Virtual REFLORA 510 1398 link
MG herbarium - Museu Paraense Emílio Goeldi - Herbário Virtual REFLORA 465 1742 link
SPF herbarium - Universidade de são Paulo - Herbário Virtual REFLORA 437 2499 link
MO herbarium - Missouri Botanical Garden - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 421 2840 link
CEN herbarium - Embrapa Recursos Genéticos e Biotecnologia - Herbário Virtual REFLORA 394 1915 link
HCF herbarium - Universidade Tecnológica Federal do Paraná - Campus Campo Mourão - Herbário Virtual REFLORA 382 2369 link
HUEFS herbarium - Universidade Estadual de Feira de Santana - Herbário Virtual REFLORA 377 1394 link
MBML Herbarium - Museu de Biologia Mello Leitão - Herbário Virtual Reflora 370 1866 link
ICN herbarium - Universidade Federal do Rio Grande Do Sul - Herbário Virtual REFLORA 354 1607 link
K herbarium - Royal Botanic Gardens, Kew - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 342 1107 link
GH herbarium - Harvard University Herbarium - Herbário Virtual Reflora 330 442 link
The System-wide Information Network for Genetic Resources (SINGER) 324 50306 link
S herbarium - Naturhistoriska Riksmuseet - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 273 1480 link
W herbarium - Naturhistorisches Museum Wien - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA 269 1150 link
Marine metagenomes Metagenome 243 88447 link
marine metagenome Metagenome 187 78804 link
INSDC Environment Sample Sequences 166 733 link
UPCB herbarium - Universidade Federal de Paraná - Herbário Virtual REFLORA 165 573 link
PC herbarium - Cryptogamy Collection at the Muséum National d'Histoire Naturelle - Herbário Virtual REFLORA 163 454 link
SJRP herbarium - Universidade Estadual Paulista Júlio de Mesquita Filho - Herbário Virtual REFLORA 152 662 link
INSDC Sequences 142 226 link
EAC herbarium - Universidade Federal do Ceará - Herbário Virtual REFLORA 127 783 link
HEPH herbarium - Jardim Botânico de Brasília - Herbário Virtual REFLORA 124 684 link
International Barcode of Life project (iBOL) 119 271 link
Nahant Collection 109 56222 link
sediment metagenome Metagenome 105 59980 link
Microbial communities associated with Eurasian watermilfoil, water and sediment 102 55052 link
HUFU herbarium - Universidade Federal de Uberlândia - Herbário Virtual REFLORA 100 416 link
Temporal effect of plant diversity and oiling on nitrogen cycling in marsh sediments 97 53601 link
Sediment Metagenome Raw sequence reads 97 56217 link
HUEM herbarium - Universidade Estadual de Maringá - Herbário Virtual REFLORA 92 332 link
Soil marker gene sequences across the Nutrient Network 92 53167 link
HUEMG herbarium - Universidade do Estado de Minas Gerais - Campus Carangola - Herbário Virtual REFLORA 91 223 link
Microbial community structure is affected by cropping sequences and bio-covers under long-term no-tillage 87 50103 link
Response of soil bacteria to anthropogenic soil variables at large spatial scales 87 51489 link
Soil microbial distribution 85 51243 link
RB - Rio de Janeiro Botanical Garden Herbarium Collection 83 109 link
UB herbarium - Universidade de Brasília - Herbário Virtual REFLORA 82 352 link
Abundance, diversity and distribution of Legionellales in wet environments in Sweden 81 48849 link
jhnwllr commented 2 years ago

Contacted AntWeb by email.

I also noticed that with AntWeb many names were corrupted with an id number or something else, so the number of fixable names is probably less than 300.

jhnwllr commented 2 years ago

Contacted Reflora network.

jhnwllr commented 2 years ago

These are the remaining datasets that have this issue. Almost all of them are MGnify @thomasstjerne

datasettitle publishingOrganizationTitle n_names n_occ link
SINGER Coordinator Bioversity International 76 8050 link
The System-wide Information Network for Genetic Resources (SINGER) Bioversity International 61 11738 link
Marine metagenomes Metagenome MGnify 55 18531 link
marine metagenome Metagenome MGnify 52 20659 link
Temporal effect of plant diversity and oiling on nitrogen cycling in marsh sediments MGnify 30 11598 link
Microbial community structure is affected by cropping sequences and bio-covers under long-term no-tillage MGnify 28 10923 link
Soil marker gene sequences across the Nutrient Network MGnify 27 12088 link
Response of soil bacteria to anthropogenic soil variables at large spatial scales MGnify 24 10644 link
Nahant Collection MGnify 24 9967 link
Bacerial and archaeal diversity in Central Park MGnify 23 10641 link
sediment metagenome Metagenome MGnify 23 10000 link
Organic agricultural field soil Raw sequence reads MGnify 23 8976 link
estuary metagenome Raw sequence reads MGnify 23 9989 link
Soil microbial distribution MGnify 23 9670 link
Sediment Metagenome Raw sequence reads MGnify 23 8819 link
MeHg Coal Ash MGnify 21 7038 link
Who eats the tough stuff DNA stable isotope probing (SIP) of bacteria and fungi degrading 13C-labelled lignin and cellulose in forest soils MGnify 20 9882 link
Microbial communities associated with Eurasian watermilfoil, water and sediment MGnify 20 8037 link
Sediment surface microbial community changes due to phytoplankton addition MGnify 20 8128 link
marine metagenome Metagenome MGnify 20 9387 link
Bacteria exposure experiment North Sea MGnify 20 8945 link
Comparative 16S analysis of hydrothermal vent samples from the Mid-Atlantic Ridge (MAR) MGnify 19 10121 link
soil bacteria and fungi Targeted loci environmental MGnify 18 9793 link
Compost microbe establishment and growth in agricultural soils MGnify 17 9138 link
EMOSE (2017) Inter-Comparison of Marine Plankton Metagenome Analysis Methods MGnify 17 8107 link
Po river Prodelta and Mar Piccolo of Taranto surface sediments bacterial communities targeted loci MGnify 17 7755 link
Raw sequence reads from soil relic DNA study MGnify 17 9334 link
uncultured prokaryote Targeted loci environmental MGnify 17 8414 link
Abundance, diversity and distribution of Legionellales in wet environments in Sweden MGnify 17 7194 link
Non-target effects of Metarhizium brunneum on soil microbial communities MGnify 17 9007 link
Queensland Marine Sediment MGnify 16 6214 link
Coastal Sediment Bacterial Community Alterations in Association with Sudden Vegetation Dieback MGnify 16 7089 link
SCHEMA_Sediment MGnify 16 4233 link
16S rRNA genes Random survey MGnify 16 7928 link
Coral MGnify 16 5161 link
Microbial diversity in the Benguela coastal upwelling system as derived from 16S rRNA sequencing and RNA Stable Isotope Probing (SIP) MGnify 15 6324 link
Lake sediment sequencing MGnify 15 8235 link
Effects of organic matter manipulation on archaeal, bacterial, and fungal community assembly MGnify 15 3087 link
soil metagenome Raw sequence reads MGnify 14 7073 link
Chandeleur Island 2016 Amplicon Study Raw sequence reads MGnify 14 5738 link
Microbial diversity in benthic stream sediment MGnify 14 7839 link
PAH-contaminated sediment from the Lagos Lagoon, Nigeria MGnify 14 6060 link
Time course change of microbial community in tsunami sediment caused by the Great East Japan Earthquake MGnify 14 8569 link
Marine sediment microbial communities in the presence of macrophytes MGnify 13 7073 link
Assessing the microbial diversity in Cape Comorin ocean water MGnify 13 4593 link
Temporal variation in pesticide biodegradation MGnify 13 7251 link
Soil microbial diversity in the Maintenance of Exotic vs. Native Diversity experiment (depth study) MGnify 13 7795 link
Camargue soil and sediment 16S community profiling MGnify 13 5782 link
16S amplicons New Zealand agricultural soils MGnify 13 5373 link
Development and validation of a multi-trophic metabarcoding biotic index for benthic organic enrichment biomonitoring using a salmon farm case-study. MGnify 13 6604 link
jhnwllr commented 2 years ago

It appears the two Bioversity International datasets with this problem are orphans so I am marking as pending.

timhirsch commented 2 years ago

@jhnwllr fyi I have a pending task to review our connections with Bioversity International as they recently merged with CIAT in Colombia (see https://alliancebioversityciat.org/ ) and it could be an opportunity for them to refresh/replace these old datasets.