Open jhnwllr opened 2 years ago
A very large source of these is AntWeb, https://www.gbif.org/dataset/13b70480-bd69-11dd-b15f-b8a03c50a862. Has anyone ever reached out to Brian Fisher and Jon Fong to inquire why so many of their scientific names are expressed like this https://www.gbif.org/occurrence/3501417310?
These are the top datasets which do not capitalize the first letter in their v_scientificname, which leads to higherrank matches to the GBIF backbone.
datasettitle | num unique names | num occ | link |
---|---|---|---|
MBM herbarium - Museu Botânico Municipal \ Curitiba - Herbário Virtual REFLORA | 3970 | 14626 | link |
AntWeb | 3108 | 33319 | link |
B herbarium - Botanischer Garten und Botanisches Museum Berlin-Dahlem Herbarium - Herbário Virtual REFLORA | 1695 | 3168 | link |
CEPEC herbarium - Centro de Pesquisas do Cacau - Herbário Virtual REFLORA | 1644 | 5201 | link |
P herbarium - Muséum national d’histoire naturelle, Paris - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 1509 | 4768 | link |
IAN herbarium - Embrapa Amazônia Oriental - Herbário Virtual REFLORA | 982 | 3198 | link |
NY herbarium - The New York Botanical Garden - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 852 | 4695 | link |
US herbarium - Smithsonian Institute - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 815 | 3042 | link |
ALCB herbarium - Universidade Federal da Bahia - Herbário Virtual REFLORA | 684 | 2929 | link |
SINGER Coordinator | 557 | 62543 | link |
HUCP herbarium - Pontífica Universidade Católica do Paraná PUC - Herbário Virtual Reflora | 513 | 2976 | link |
ESA herbarium - Universidade de São Paulo - Herbário Virtual REFLORA | 510 | 1398 | link |
MG herbarium - Museu Paraense Emílio Goeldi - Herbário Virtual REFLORA | 465 | 1742 | link |
SPF herbarium - Universidade de são Paulo - Herbário Virtual REFLORA | 437 | 2499 | link |
MO herbarium - Missouri Botanical Garden - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 421 | 2840 | link |
CEN herbarium - Embrapa Recursos Genéticos e Biotecnologia - Herbário Virtual REFLORA | 394 | 1915 | link |
HCF herbarium - Universidade Tecnológica Federal do Paraná - Campus Campo Mourão - Herbário Virtual REFLORA | 382 | 2369 | link |
HUEFS herbarium - Universidade Estadual de Feira de Santana - Herbário Virtual REFLORA | 377 | 1394 | link |
MBML Herbarium - Museu de Biologia Mello Leitão - Herbário Virtual Reflora | 370 | 1866 | link |
ICN herbarium - Universidade Federal do Rio Grande Do Sul - Herbário Virtual REFLORA | 354 | 1607 | link |
K herbarium - Royal Botanic Gardens, Kew - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 342 | 1107 | link |
GH herbarium - Harvard University Herbarium - Herbário Virtual Reflora | 330 | 442 | link |
The System-wide Information Network for Genetic Resources (SINGER) | 324 | 50306 | link |
S herbarium - Naturhistoriska Riksmuseet - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 273 | 1480 | link |
W herbarium - Naturhistorisches Museum Wien - Amostras Brasileiras Repatriadas - Herbário Virtual REFLORA | 269 | 1150 | link |
Marine metagenomes Metagenome | 243 | 88447 | link |
marine metagenome Metagenome | 187 | 78804 | link |
INSDC Environment Sample Sequences | 166 | 733 | link |
UPCB herbarium - Universidade Federal de Paraná - Herbário Virtual REFLORA | 165 | 573 | link |
PC herbarium - Cryptogamy Collection at the Muséum National d'Histoire Naturelle - Herbário Virtual REFLORA | 163 | 454 | link |
SJRP herbarium - Universidade Estadual Paulista Júlio de Mesquita Filho - Herbário Virtual REFLORA | 152 | 662 | link |
INSDC Sequences | 142 | 226 | link |
EAC herbarium - Universidade Federal do Ceará - Herbário Virtual REFLORA | 127 | 783 | link |
HEPH herbarium - Jardim Botânico de Brasília - Herbário Virtual REFLORA | 124 | 684 | link |
International Barcode of Life project (iBOL) | 119 | 271 | link |
Nahant Collection | 109 | 56222 | link |
sediment metagenome Metagenome | 105 | 59980 | link |
Microbial communities associated with Eurasian watermilfoil, water and sediment | 102 | 55052 | link |
HUFU herbarium - Universidade Federal de Uberlândia - Herbário Virtual REFLORA | 100 | 416 | link |
Temporal effect of plant diversity and oiling on nitrogen cycling in marsh sediments | 97 | 53601 | link |
Sediment Metagenome Raw sequence reads | 97 | 56217 | link |
HUEM herbarium - Universidade Estadual de Maringá - Herbário Virtual REFLORA | 92 | 332 | link |
Soil marker gene sequences across the Nutrient Network | 92 | 53167 | link |
HUEMG herbarium - Universidade do Estado de Minas Gerais - Campus Carangola - Herbário Virtual REFLORA | 91 | 223 | link |
Microbial community structure is affected by cropping sequences and bio-covers under long-term no-tillage | 87 | 50103 | link |
Response of soil bacteria to anthropogenic soil variables at large spatial scales | 87 | 51489 | link |
Soil microbial distribution | 85 | 51243 | link |
RB - Rio de Janeiro Botanical Garden Herbarium Collection | 83 | 109 | link |
UB herbarium - Universidade de Brasília - Herbário Virtual REFLORA | 82 | 352 | link |
Abundance, diversity and distribution of Legionellales in wet environments in Sweden | 81 | 48849 | link |
Contacted AntWeb by email.
I also noticed that with AntWeb many names were corrupted with an id number or something else, so the number of fixable names is probably less than 300.
Contacted Reflora network.
These are the remaining datasets that have this issue. Almost all of them are MGnify @thomasstjerne
datasettitle | publishingOrganizationTitle | n_names | n_occ | link |
---|---|---|---|---|
SINGER Coordinator | Bioversity International | 76 | 8050 | link |
The System-wide Information Network for Genetic Resources (SINGER) | Bioversity International | 61 | 11738 | link |
Marine metagenomes Metagenome | MGnify | 55 | 18531 | link |
marine metagenome Metagenome | MGnify | 52 | 20659 | link |
Temporal effect of plant diversity and oiling on nitrogen cycling in marsh sediments | MGnify | 30 | 11598 | link |
Microbial community structure is affected by cropping sequences and bio-covers under long-term no-tillage | MGnify | 28 | 10923 | link |
Soil marker gene sequences across the Nutrient Network | MGnify | 27 | 12088 | link |
Response of soil bacteria to anthropogenic soil variables at large spatial scales | MGnify | 24 | 10644 | link |
Nahant Collection | MGnify | 24 | 9967 | link |
Bacerial and archaeal diversity in Central Park | MGnify | 23 | 10641 | link |
sediment metagenome Metagenome | MGnify | 23 | 10000 | link |
Organic agricultural field soil Raw sequence reads | MGnify | 23 | 8976 | link |
estuary metagenome Raw sequence reads | MGnify | 23 | 9989 | link |
Soil microbial distribution | MGnify | 23 | 9670 | link |
Sediment Metagenome Raw sequence reads | MGnify | 23 | 8819 | link |
MeHg Coal Ash | MGnify | 21 | 7038 | link |
Who eats the tough stuff DNA stable isotope probing (SIP) of bacteria and fungi degrading 13C-labelled lignin and cellulose in forest soils | MGnify | 20 | 9882 | link |
Microbial communities associated with Eurasian watermilfoil, water and sediment | MGnify | 20 | 8037 | link |
Sediment surface microbial community changes due to phytoplankton addition | MGnify | 20 | 8128 | link |
marine metagenome Metagenome | MGnify | 20 | 9387 | link |
Bacteria exposure experiment North Sea | MGnify | 20 | 8945 | link |
Comparative 16S analysis of hydrothermal vent samples from the Mid-Atlantic Ridge (MAR) | MGnify | 19 | 10121 | link |
soil bacteria and fungi Targeted loci environmental | MGnify | 18 | 9793 | link |
Compost microbe establishment and growth in agricultural soils | MGnify | 17 | 9138 | link |
EMOSE (2017) Inter-Comparison of Marine Plankton Metagenome Analysis Methods | MGnify | 17 | 8107 | link |
Po river Prodelta and Mar Piccolo of Taranto surface sediments bacterial communities targeted loci | MGnify | 17 | 7755 | link |
Raw sequence reads from soil relic DNA study | MGnify | 17 | 9334 | link |
uncultured prokaryote Targeted loci environmental | MGnify | 17 | 8414 | link |
Abundance, diversity and distribution of Legionellales in wet environments in Sweden | MGnify | 17 | 7194 | link |
Non-target effects of Metarhizium brunneum on soil microbial communities | MGnify | 17 | 9007 | link |
Queensland Marine Sediment | MGnify | 16 | 6214 | link |
Coastal Sediment Bacterial Community Alterations in Association with Sudden Vegetation Dieback | MGnify | 16 | 7089 | link |
SCHEMA_Sediment | MGnify | 16 | 4233 | link |
16S rRNA genes Random survey | MGnify | 16 | 7928 | link |
Coral | MGnify | 16 | 5161 | link |
Microbial diversity in the Benguela coastal upwelling system as derived from 16S rRNA sequencing and RNA Stable Isotope Probing (SIP) | MGnify | 15 | 6324 | link |
Lake sediment sequencing | MGnify | 15 | 8235 | link |
Effects of organic matter manipulation on archaeal, bacterial, and fungal community assembly | MGnify | 15 | 3087 | link |
soil metagenome Raw sequence reads | MGnify | 14 | 7073 | link |
Chandeleur Island 2016 Amplicon Study Raw sequence reads | MGnify | 14 | 5738 | link |
Microbial diversity in benthic stream sediment | MGnify | 14 | 7839 | link |
PAH-contaminated sediment from the Lagos Lagoon, Nigeria | MGnify | 14 | 6060 | link |
Time course change of microbial community in tsunami sediment caused by the Great East Japan Earthquake | MGnify | 14 | 8569 | link |
Marine sediment microbial communities in the presence of macrophytes | MGnify | 13 | 7073 | link |
Assessing the microbial diversity in Cape Comorin ocean water | MGnify | 13 | 4593 | link |
Temporal variation in pesticide biodegradation | MGnify | 13 | 7251 | link |
Soil microbial diversity in the Maintenance of Exotic vs. Native Diversity experiment (depth study) | MGnify | 13 | 7795 | link |
Camargue soil and sediment 16S community profiling | MGnify | 13 | 5782 | link |
16S amplicons New Zealand agricultural soils | MGnify | 13 | 5373 | link |
Development and validation of a multi-trophic metabarcoding biotic index for benthic organic enrichment biomonitoring using a salmon farm case-study. | MGnify | 13 | 6604 | link |
It appears the two Bioversity International datasets with this problem are orphans so I am marking as pending.
@jhnwllr fyi I have a pending task to review our connections with Bioversity International as they recently merged with CIAT in Colombia (see https://alliancebioversityciat.org/ ) and it could be an opportunity for them to refresh/replace these old datasets.
For example,
Simply changing acerates viridiflora (raf.) eaton https://api.gbif.org/v1/species/match?name=acerates%20viridiflora%20(raf.)%20eaton to Acerates viridiflora (raf.) eaton https://api.gbif.org/v1/species/match?name=Acerates%20viridiflora%20(raf.)%20eaton leads to an exact match.
Running all of GBIF's unique uncapped first letter higherrank-flagged v_scientificname(s) (with a few other filters) through the name matcher produces the following table.
269,924 occurrences flagged as HIGHERRANK get moved to EXACT matchType when simply capitalizing the first letter of the name. 17,964 occurrences flagged as HIGHERRANK get moved to FUZZY matchType.
It appears that simply default capitalizing the first letter of v_scientificname might lead to many more matches to the GBIF backbone.
Some examples in the table below.
@mdoering