ARGA-Genomes / arga-data

ARGA
Mozilla Public License 2.0
0 stars 0 forks source link

GBIF taxonomic backbone #24

Open nickdos opened 2 years ago

nickdos commented 2 years ago

https://www.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c

Download the DwCA version and create a names matching index, re-run pipelines processing and compare the counts for the match-type field before and after. To assess whether the GBIF backbone is a better source than the ALA one.

GBIF's Checklist Bank site (https://www.checklistbank.org/) allows individual taxonomy datasets to be downloaded as DwCA files, so there is a possibility of picking the sources we need and using these over the complete (huge) GBIF taxonomy.

nickdos commented 2 years ago

Before stats - https://nectar-arga-dev-1.ala.org.au/api/select?q=*:*&facet=true&facet.field=matchType&rows=0


{
  "matchType": [
    "exactMatch",1018593,
    "higherMatch",326999,
    "canonicalMatch",41855,
    "fuzzyMatch",302,
    "phraseMatch",21,
     "taxonIdMatch",4
  ]
}
nickdos commented 2 years ago

Attempted to load into names index via merge but it ran out of memory on my machine. See https://github.com/AtlasOfLivingAustralia/ala-name-matching/issues/162.

GBIF provide the individual name sources via the https://www.checklistbank.org/dataset/2169/download download tool. So I'm attempting to merge in the NCBI DwCA from there, as a first try.