Closed jhpoelen closed 2 years ago
This enables streaming of all of GBIF by saying:
$ preston ls --remote https://deeplinker.bio | preston dwc-stream | gzip > gbif.json.gz
where all.json.gz is a giant super huge file with all records.
Some more specific examples -
extract all scientificNames
preston ls --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/\
| preston dwc-stream --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/\
| jq --raw-output '.["http://rs.tdwg.org/dwc/terms/scientificName"]'\
| grep -v null\
> names.txt
with names.txt attached.
top 10 names.txt
$ head names.txt
Chalceus guaporensis Zanata & Toledo-Piza 2004
Acestrorhynchus gr. lacustris (Lütken 1875)
Acestrorhynchus microlepis (Jardine 1841)
Laemolyta proxima (Garman 1890)
Pseudanos trimaculatus (Kner 1858)
Schizodon fasciatus Spix & Agassiz 1829
Cynopotamus gouldingi Menezes 1987
Moenkhausia lepidura (Kner 1858)
Tetragonopterus argenteus Cuvier 1816
Astyanax abramis (Jenyns 1842)
then combine with Nomer or other taxonomic name matching tools to do name alignment .
generating names and their associated exact locations of their original source
preston ls --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/\
| preston dwc-stream --remote https://raw.githubusercontent.com/bio-guoda/preston-amazon/master/data/\
| jq -c '. | select(has("http://rs.tdwg.org/dwc/terms/scientificName")) | { "http://rs.tdwg.org/dwc/terms/scientificName": ."http://rs.tdwg.org/dwc/terms/scientificName", contentId: .contentId } ' | head
{"http://rs.tdwg.org/dwc/terms/scientificName":"Chalceus guaporensis Zanata & Toledo-Piza 2004","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L2"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Acestrorhynchus gr. lacustris (Lütken 1875)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L3"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Acestrorhynchus microlepis (Jardine 1841)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L4"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Laemolyta proxima (Garman 1890)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L5"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Pseudanos trimaculatus (Kner 1858)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L6"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Schizodon fasciatus Spix & Agassiz 1829","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L7"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Cynopotamus gouldingi Menezes 1987","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L8"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Moenkhausia lepidura (Kner 1858)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L9"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Tetragonopterus argenteus Cuvier 1816","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L10"}
{"http://rs.tdwg.org/dwc/terms/scientificName":"Astyanax abramis (Jenyns 1842)","contentId":"line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L11"}
with
$ preston cat 'line:zip:hash://sha256/97cbeae429fbc95d1859f7afa28b33f08ac64125ba72511c49c4b77ca66d2d66!/occurrence.txt!/L2'
1-37067-988 This work is licensed under a Creative Commons CCZero 1.0 License http://creativecommons.org/publicdomain/zero/1.0/legalcode and available according to the community norms at http://www.canadensys.net/norms. The data provider requests to be informed of publication 45 days in advance and can object to the use of the dataset within 30 days. CIRA The data provider requests to be informed of publication 45 days in advance and can object to the use of the dataset within 30 days. BioFresh (publisher), Yukoni, T. and Torres L. V. (provider), Universidad Autónoma del Beni 'José Ballivian' (CIRA-UAB) (owner) (2015). Bolivian Amazon lowland fish metacommunity data. doi:10.13148/bfe105 Published on http://data.freshwaterbiodiversity.eu, accessed on [date]. doi:10.13148/bfe105 CIRA-UNAN CIRA PreservedSpecimen waterBodyType=lake 1-37067-988 988 2 ML1-37067 25/06/01 2001 Site was sampled on 2 consecutive days. Date reported in the eventDate field is the first day of the visit. ML1 Manuripi Bolivia Manuripi lake -11.95204 -68.65672 Chalceus guaporensis Zanata & Toledo-Piza 2004 Animalia Chalceidae Chalceus guaporensis species Zanata & Toledo-Piza 2004
the original line containing scientificName "Chalceus guaporensis Zanata & Toledo-Piza"
example for extracting image urls for UC Santa Barbara's @seltmann invertebrate zoology collection -
preston track "https://serv.biokic.asu.edu/ecdysis/content/dwca/UCSB-IZC_DwC-A.zip"\
| preston dwc-stream\
| grep "http://rs.tdwg.org/ac/terms/Multimedia"\
| jq --raw-output '.["http://rs.tdwg.org/ac/terms/goodQualityAccessURI"], .["http://rs.tdwg.org/ac/terms/accessURI"]'\
| sort\
| uniq
> ucsb-izc-image-urls.txt
with
$ cat ucsb-izc-image-urls.txt
45530
$ head ucsb-izc-image-urls.txt
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000001.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000001.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000002.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000002.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000003.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000003.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000003_lg.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000004.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000004.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000004_lg.jpg
Now, tracking all image urls . . . would be:
cat ucsb-izc-image-urls.txt\
| xargs -L100 preston track
So, putting it together, you'd be able to track the UCSB-IZC and its images using:
preston track "https://serv.biokic.asu.edu/ecdysis/content/dwca/UCSB-IZC_DwC-A.zip"\
> | preston dwc-stream\
> | grep "http://rs.tdwg.org/ac/terms/Multimedia"\
> | jq --raw-output '.["http://rs.tdwg.org/ac/terms/goodQualityAccessURI"], .["http://rs.tdwg.org/ac/terms/accessURI"]'\
> | sort\
> | uniq\
> | xargs -L100 preston track
initial support introduced in https://github.com/bio-guoda/preston/releases/tag/0.3.5 . Suggest to report future improvement in separate issue.
To help increase access to biodiversity data, suggest to make Preston stream DwC-A records in json line-by-line.
with