Open mielliott opened 2 years ago
@JimmyFromRobotics @mielliott very neat to see the distribution of zip files names.
Am I correct if I say that the files with ~36758
frequency of occurrence originate from Plazi?
from https://gbif.org (see attached screenshot) , there's a claim that there's 63k datasets. However, only about 60k meta.xml files were found. Assuming that every dwc-a has a meta.xml file, this means that 3k datasets are unaccounted for, invalid, or duplicates originating from different dataset endpoints.
With this, questions come up like: how many unique urls are associated with the inspected hash zips? How many dwc-a were not zips, but tar balls, or some other kind of format?
So many questions . . . I wonder what would be a good way to visualize this at GBIF / iDigBio scales.
A grid of 1024 x 768 pixels would give about 786k / 60k > 10 pixels per dataset as defined by their hash. Big enough to be a perceivable dot on a screen . . . Then show the pattern for gbif, idigbio and combined. This would give a visual comparison. Also color may be used to code the frequency of occurrence. Probably something you already thought about .
Curious to hear your thoughts!
Am I correct if I say that the files with ~
36758
frequency of occurrence originate from Plazi?
I'm guessing you're looking at the number of zips containing "distribution.txt"? Let's check!
$ preston alias -l tsv --remote file:///mnt/preston.acis.ufl.edu_data/gbif-idigbio-biocase/data/ | cut -f1,3 | sort -u -k2
> urlVersions
$ join urlVersions -1 2 <(grep "^distribution.txt" zipFileNames | cut -f2) -2 1 | sed 's/ /\t/' | grep plazi | cut -f1 | sort -u | wc -l
36136
Looks like 36,136 out of the 36,758 (98.3%) have been observed at Plazi URLs.
how many unique urls are associated with the inspected hash zips?
$ cat zipFileNames | cut -f2 | sort -u | wc -l
60497
$ cat urlVersions | cut -f1 | sort -u | wc -l
103948
$ join urlVersions -1 2 zipFileNames -2 2 | sed 's/ /\t/' > hashUrlFiles
$ cut -f2 hashUrlFiles | sort -u | wc -l
62324
Out of the 103,948 URLs seen since Sept. 2018, 62,324 are associated with 60,497 inspected zips.
More numbers coming soon! The poor computer can only crunch so fast
while we wait...
$ cat urlVersions | cut -f2 | grep -v "well-known/genid" | sort -u | wc -l
543955
We're up to 543,955 content versions observed in GBIF, iDigBio, and BioCASe (this includes registries and such)
How many dwc-a were not zips, but tar balls, or some other kind of format?
I used the file
command to check the file types of ~all our stuff we've collected~ the stuff we collected on 2021-09-30:
$ time cat hashes | xargs -L 1 -I "hash" bash -c "echo hash | ./getFileTypes" > hashFileTypes
real 40m59.704s
user 8m58.900s
sys 3m13.106s
Here's the scoop:
$ cat hashFileTypes | cut -f2 | sort | uniq -c | sort -rn
60497 application/zip; charset=binary
9998 text/html; charset=utf-8
3500 text/html; charset=us-ascii
2761 text/plain; charset=utf-8
849 text/xml; charset=us-ascii
441 text/plain; charset=us-ascii
145 text/xml; charset=utf-8
3 application/gzip; charset=binary
1 text/xml; charset=utf-16le
1 inode/x-empty; charset=binary
1 application/octet-stream; charset=binary
Getting rid of that "charset" nonsense,
$ cat hashFileTypes | cut -f2 | cut -f1 -d";" | sort | uniq -c | sort -rn
60497 application/zip
13498 text/html
3202 text/plain
995 text/xml
3 application/gzip
1 inode/x-empty
1 application/octet-stream
We have a grand total of 3 gzips. Almost every archive is a zip!
$ join urlVersions -1 2 <(grep "gzip" hashFileTypes | cut -f1) -2 1
hash://sha256/905075cc27ba080137b79298a32e12b03223de77f7bb3f652bc64ff625cc09c7 https://hosted-datasets.gbif.org/datasets/fauna_europaea-lepidoptera.tar.gz
hash://sha256/ad022203de7b432a798ac18e64f1913be5b5c5e813f6ceabedf4e48416dd6c81 https://download.catalogueoflife.org/col/latest_dwca.zip
hash://sha256/bedcc1f122d59ec002e0e6d2802c0e422eadf6208669fff141a895bd3ed15d4a https://hosted-datasets.gbif.org/datasets/fauna_europaea.tar.gz
And those 3 gzips are indeed DwC-As
The octet-stream
is a GBIF registry page, really a json; maybe the file
command just went nuts:
$ join urlVersions -1 2 <(grep "octet-stream" hashFileTypes | cut -f1) -2 1
hash://sha256/b3c0d1b95b945cb0fee24f4586ee8beadf0e28871792c19bb078d7d4696d5f24 https://api.gbif.org/v1/dataset?offset=53940&limit=20
$ preston get hash://sha256/b3c0d1b95b945cb0fee24f4586ee8beadf0e28871792c19bb078d7d4696d5f24 --remote https://preston.guoda.bio/ | cut -b-100
{"offset":53940,"limit":20,"endOfRecords":false,"count":62389,"results":[{"key":"7fe1bc0e-f762-11e1-
and the inode/x-empty
is just an empty file. Can confirm by sha256sum
ing an empty stream:
$ join urlVersions -1 2 <(grep "x-empty" hashFileTypes | cut -f1) -2 1
hash://sha256/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 http://bdj.pensoft.net/lib/ajax_srv/archive_download.php?archive_type=2&document_id=6313
$ echo -n "" | sha256sum
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
A grid of 1024 x 768 pixels would give about 786k / 60k > 10 pixels per dataset as defined by their hash. Big enough to be a perceivable dot on a screen . . . Then show the pattern for gbif, idigbio and combined. This would give a visual comparison. Also color may be used to code the frequency of occurrence. Probably something you already thought about .
Curious to hear your thoughts!
Love it! A timelapse animation of that would be especially fun.... @JimmyFromRobotics :eyes:
Using provenance
hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f
with timestamp 2021-09-30 at https://preston.acis.ufl.edu/ (observatory of GBIF, iDigBio, and BioCASe)@JimmyFromRobotics and I took a peek at the contents of all the zip files observed in the latest preston crawl.
where
zipFileNames
is a list of zip files (right column) and the files they contain (left column):Count zips:
Count the number of times each file name appears in the zip files:
Count files in the zips:
Count zips containing eml files:
Count zips containing occurrence tables:
@jhpoelen in case you were ever curious...