beehind / beehind.github.io

Beehind: pilot workflows to capture prominent bee specimen and their historic and ecological associates
https://beehind.org
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Type Specimen CASTYPE1652 found via filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq #5

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

as posted in

https://discourse.gbif.org/t/type-specimen-castype1652-found-in-via-filtered-query-https-doi-org-10-15468-dl-xf6ahb-but-not-in-open-access-gbif-data-product-https-doi-org-1

on 2023-03-24


Hi!

First, thanks for providing this open discussion forum in addition to maintaining the expansive biodiversity data-universe that GBIF maintains.

Second, apologies in advance for the long and rather detailed post below.

The executive summary is that I am trying to figure out why I can find type Specimen CASTYPE1652 in filtered query https://doi.org/10.15468/dl.xf6ahb, but not in open-access GBIF data product https://doi.org/10.15468/dl.pk3trq .

The text below described how I got to the datasets, and ends with specific questions.

As I am tracking (versioned) digital traces associated with type specimen CASTYPE1652 (see https://beehind.org), I downloaded the open access data product (all :

GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq

via https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip to produce ~260G of digital content with id hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97 .

then, I used a streaming query to count all lines in the "simple" table that was included in the file. In addition, I attempted to filter the data to include only records with collectionCode CASTYPE , the collection code of the collections that keeps the type specimen with catalog number CASTYPE1652 .

After 5h15m processing at a rate of about 100k lines/s , I counted 2.07 billion lines. Also, I found that no records found with collectionCode CASTYPE.

To confirm that the collectionCode CASTYPE was actually used in associated records, and existed on and prior to 1 March 2023, I verified that 1 March 2023 (https://linker.bio/zip:hash://sha256/ffffe616beab7b4a04e46162cdbd2584f986e3f5f5b56258f9737ee31f36b6b6!/occurrence.txt), and 1 January 2023 (https://linker.bio/zip:hash://sha256/110f398aa4c8a4be870c7b3c1d698c32eb2c8dad878b614fe8e8f7a153251a43!/occurrence.txt) of the DarwinCore archive provided by the California Academy of Sciences via http://ipt.calacademy.org:8080/archive.do?r=type included records with collection code CASTYPE.

Also, I logged in to the GBIF web portal and created a "download" with citation:

GBIF.org (24 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.xf6ahb

This download included a filter to only include records associated with GBIF dataset https://www.gbif.org/dataset/6ec3c7f5-6233-48f6-b36a-06b867edbadd associated with the CASTYPE collection.

Using the same methods as earlier, I selected records including mention of collectionCode CASTYPE . Contrary to the earlier results, records with CASTYPE collectionCode now appeared, including CASTYPE1652.

So, given the contradictory results, I was wondering:

  1. Can anybody confirm that CASTYPE records (including CASTYPE1652) do not appear in GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq ?
  2. Can someone explain why the gbif front page claims to have over 2.2 billion records indexed, whereas GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq appears to include about 200M records less ?

Most likely, I don't fully understand what to expect to be included in GBIF.org (01 March 2023) GBIF Occurrence Download https://doi.org/10.15468/dl.pk3trq , so I very much appreciate your insights to better understand these valuable datasets.

Again, apologies for the long and detailed post, and I am curious to hear anyone thoughts on how I should proceed.

thx, -jorrit

https://jhpoelen.nl

PS. The overarching use case is to document associations between GBIF occurrence identifiers and their associated institution code, collection code, and catalog number. I need this to establish links between CASTYPE1652 (or other specimen) and their digital traces in GBIF , and, indirectly, to Bionomia. Because Bionomia uses gbif identifiers to link people to their associated records, I need to "speak" GBIF identifiers to resolve the wealth of knowledge of the people behind collections as facilitated/enriched by @dshorthouse https://bionomia.net . fyi @Debbie @seltmann

jhpoelen commented 1 year ago

via -

Poelen, Jorrit. (2023). Global Biodiversity Informatics Facility (GBIF): an exhaustive list of gbif record ids, dataset keys, and their associated Occurrence IDs, Institution Code, Collection Codes and Catalog Numbers (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7789866

I was able to extract gbif records ids associated with CASTYPE identifiers -

preston cat --remote https://zenodo.org/record/7789866/files,https://linker.bio hash://sha256/a339e32e10edaad585f61f2ded06cbb23e0618c65a6360db18d7d729054940a8\
 | gunzip\
 | grep -E "CASTYPE[0-9]+"\
 | tee castype.tsv.txt

first 10 rows being:

2275276454  db6a16a6-0fe8-4987-8ad5-42223825fcd7    03E987E2FE8B2B6EFF3ED117FB5AFBDC.mc.3B283CA9FE8A2B6DFDBDD12EFD82F845    CAS     CASTYPE19452, MA-02-14A-35
2275275513  db6a16a6-0fe8-4987-8ad5-42223825fcd7    03E987E2FE7D2B9BFF3ED39FFB57FE0C.mc.3B283CA9FE7C2B9BFDA8D6F3FD25FE98    CAS     CASTYPE19463
2275274939  db6a16a6-0fe8-4987-8ad5-42223825fcd7    03E987E2FDBD285BFF3ED282FA68FD74.mc.3B283CA9FDBC285BFDB3D7F4FCE1FD9C    CAS     CASTYPE19467, MA-02-08A-16
2275275452  db6a16a6-0fe8-4987-8ad5-42223825fcd7    03E987E2FE692B8FFF3ED056FA71FD2C.mc.3B283CA9FE682B8FFE56D793FC02FDB8    CAS     CASTYPE19451
2611461322  654c9653-fbef-4d0d-9e07-619084ac1162    039887DCFFB5A4512B10A4FCFA8BA2DB.mc.3B593C97FFB5A45129C7A254FDCFA0B0        CASC    CASTYPE13390
2611461316  654c9653-fbef-4d0d-9e07-619084ac1162    039887DCFF9CA4782B1DA716FC8FA1A4.mc.3B593C97FF9CA47829C7A55AFD7EA7B2        CASC    CASTYPE13386
2238764399  6ec3c7f5-6233-48f6-b36a-06b867edbadd    urn:catalog:CAS:TYPE:19153  CASC    CASTYPE CASTYPE19153
3110317301  6ec3c7f5-6233-48f6-b36a-06b867edbadd    urn:catalog:CAS:TYPE:20142  CASC    CASTYPE CASTYPE20142
3110317302  6ec3c7f5-6233-48f6-b36a-06b867edbadd    urn:catalog:CAS:TYPE:20143  CASC    CASTYPE CASTYPE20143
2238762805  6ec3c7f5-6233-48f6-b36a-06b867edbadd    urn:catalog:CAS:TYPE:19154  CASC    CASTYPE CASTYPE19154

with castype.tsv.txt attached -

castype.tsv.txt