Counting files inside zips

mielliott commented 2 years ago

Using provenance hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f with timestamp 2021-09-30 at https://preston.acis.ufl.edu/ (observatory of GBIF, iDigBio, and BioCASe)

$ preston history -l tsv --remote https://preston.acis.ufl.edu/ | tail -n1 | cut -f1
hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f

@JimmyFromRobotics and I took a peek at the contents of all the zip files observed in the latest preston crawl.

$ preston get hash://sha256/49e079a0bac47ca17c0b14fa711b7742b9332ac64e1866adf13d294692720f9f --remote https://preston.acis.ufl.edu/ \
| grep hasVersion \
| sed -r 's_<[^>]*> <http://purl.org/pav/hasVersion> <(hash://sha256/.{64})>.*_\1_' \
| grep -v '^<'
> hashes
$ cat hashes | xargs -L 1 -I "hash" bash -c "echo hash | ./getZipFileNames" > zipFileNames

where zipFileNames is a list of zip files (right column) and the files they contain (left column):

$ cat zipFileNames | head
eml.xml     hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
media.csv   hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
meta.xml    hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
occurrences.csv hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
references.csv  hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
taxa.csv    hash://sha256/000268e00b4ad0b90fb5f6c272b95e0a8c2fd4ed4c5201c36d8c67af0cf412a3
eml.xml     hash://sha256/0003b19d9177032d8adbf5def577904ac0f84b5b88533e1415605caccedc8df4
media.csv   hash://sha256/0003b19d9177032d8adbf5def577904ac0f84b5b88533e1415605caccedc8df4
meta.xml    hash://sha256/0003b19d9177032d8adbf5def577904ac0f84b5b88533e1415605caccedc8df4
occurrences.csv hash://sha256/0003b19d9177032d8adbf5def577904ac0f84b5b88533e1415605caccedc8df4

Count zips:

$ cat zipFileNames | cut -f2 | sort | uniq | wc -l
60497

Count the number of times each file name appears in the zip files:

$ cat zipFileNames | cut -f1 | sort | uniq -c | sort -rn | head -n50
  60119 meta.xml
  60114 eml.xml
  36758 distribution.txt
  36258 description.txt
  36139 taxa.txt
  36139 occurrences.txt
  36138 references.txt
  36138 media.txt
  36136 vernaculars.txt
  35759 multimedia.txt
  14536 occurrence.txt
   1885 event.txt
   1225 occurrences.csv
    965 taxon.txt
    648 images.csv
    647 identifications.csv
    468 extendedmeasurementorfact.txt
    430 measurementorfact.txt
    380 media.csv
    355 speciesprofile.txt
    350 taxa.csv
    350 references.csv
    269 resourcerelationship.txt
    247 occurrence.csv
    135 response.00001.xml
    107 typesandspecimen.txt
    106 response.00002.xml
    102 reference.txt
    100 response.00003.xml
    100 identification.txt
     95 response.00004.xml
     87 response.00005.xml
     80 response.00006.xml
     79 DarwinCore.txt
     78 response.00007.xml
     77 image.csv
     75 response.00008.xml
     73 vernacularname.txt
     73 response.00009.xml
     72 response.00010.xml
     70 response.00011.xml
     68 response.00012.xml
     67 response.00013.xml
     65 response.00014.xml
     63 dwca/vernacularnames.csv
     63 dwca/specimen.csv
     63 dwca/reference.csv
     63 dwca/meta.xml
     63 dwca/gbif-dwca.csv
     63 dwca/eml.xml
... [thousands more]

Count files in the zips:

$ cat zipFileCounts | cut -b-7 | paste -sd+ | bc
492307

Count zips containing eml files:

$ cat zipFileNames | grep "eml" | cut -f2 | sort | uniq | wc -l
60341

$ cat zipFileCounts | grep "eml" | cut -b-7 | paste -sd+ | bc
60349

Count zips containing occurrence tables:

$ cat zipFileNames | grep "occur" | cut -f2 | sort | uniq | wc -l
52227

$ cat zipFileCounts | grep "occur" | cut -b-7 | paste -sd+ | bc
52227

@jhpoelen in case you were ever curious...

jhpoelen commented 2 years ago

@JimmyFromRobotics @mielliott very neat to see the distribution of zip files names.

Am I correct if I say that the files with ~36758 frequency of occurrence originate from Plazi?

jhpoelen commented 2 years ago

from https://gbif.org (see attached screenshot) , there's a claim that there's 63k datasets. However, only about 60k meta.xml files were found. Assuming that every dwc-a has a meta.xml file, this means that 3k datasets are unaccounted for, invalid, or duplicates originating from different dataset endpoints.

With this, questions come up like: how many unique urls are associated with the inspected hash zips? How many dwc-a were not zips, but tar balls, or some other kind of format?

So many questions . . . I wonder what would be a good way to visualize this at GBIF / iDigBio scales.

Screenshot from 2021-10-25 15-23-04

jhpoelen commented 2 years ago

A grid of 1024 x 768 pixels would give about 786k / 60k > 10 pixels per dataset as defined by their hash. Big enough to be a perceivable dot on a screen . . . Then show the pattern for gbif, idigbio and combined. This would give a visual comparison. Also color may be used to code the frequency of occurrence. Probably something you already thought about .

Curious to hear your thoughts!

mielliott commented 2 years ago

Am I correct if I say that the files with ~36758 frequency of occurrence originate from Plazi?

I'm guessing you're looking at the number of zips containing "distribution.txt"? Let's check!

$ preston alias -l tsv --remote file:///mnt/preston.acis.ufl.edu_data/gbif-idigbio-biocase/data/ | cut -f1,3 | sort -u -k2
> urlVersions
$ join urlVersions -1 2 <(grep "^distribution.txt" zipFileNames | cut -f2) -2 1 | sed 's/ /\t/' | grep plazi | cut -f1 | sort -u | wc -l
36136

Looks like 36,136 out of the 36,758 (98.3%) have been observed at Plazi URLs.

how many unique urls are associated with the inspected hash zips?

$ cat zipFileNames | cut -f2 | sort -u | wc -l
60497
$ cat urlVersions | cut -f1 | sort -u | wc -l
103948
$ join urlVersions -1 2 zipFileNames -2 2 | sed 's/ /\t/' > hashUrlFiles
$ cut -f2 hashUrlFiles | sort -u | wc -l
62324

Out of the 103,948 URLs seen since Sept. 2018, 62,324 are associated with 60,497 inspected zips.

More numbers coming soon! The poor computer can only crunch so fast

mielliott commented 2 years ago

while we wait...

$ cat urlVersions | cut -f2 | grep -v "well-known/genid" | sort -u | wc -l
543955

We're up to 543,955 content versions observed in GBIF, iDigBio, and BioCASe (this includes registries and such)

mielliott commented 2 years ago

How many dwc-a were not zips, but tar balls, or some other kind of format?

I used the file command to check the file types of ~all our stuff we've collected~ the stuff we collected on 2021-09-30:

$ time cat hashes | xargs -L 1 -I "hash" bash -c "echo hash | ./getFileTypes" > hashFileTypes
real    40m59.704s
user    8m58.900s
sys     3m13.106s

Here's the scoop:

$ cat hashFileTypes | cut -f2 | sort | uniq -c | sort -rn
  60497 application/zip; charset=binary
   9998 text/html; charset=utf-8
   3500 text/html; charset=us-ascii
   2761 text/plain; charset=utf-8
    849 text/xml; charset=us-ascii
    441 text/plain; charset=us-ascii
    145 text/xml; charset=utf-8
      3 application/gzip; charset=binary
      1 text/xml; charset=utf-16le
      1 inode/x-empty; charset=binary
      1 application/octet-stream; charset=binary

Getting rid of that "charset" nonsense,

$ cat hashFileTypes | cut -f2 | cut -f1 -d";" | sort | uniq -c | sort -rn
  60497 application/zip
  13498 text/html
   3202 text/plain
    995 text/xml
      3 application/gzip
      1 inode/x-empty
      1 application/octet-stream

We have a grand total of 3 gzips. Almost every archive is a zip!

$ join urlVersions -1 2 <(grep "gzip" hashFileTypes | cut -f1) -2 1
hash://sha256/905075cc27ba080137b79298a32e12b03223de77f7bb3f652bc64ff625cc09c7 https://hosted-datasets.gbif.org/datasets/fauna_europaea-lepidoptera.tar.gz
hash://sha256/ad022203de7b432a798ac18e64f1913be5b5c5e813f6ceabedf4e48416dd6c81 https://download.catalogueoflife.org/col/latest_dwca.zip
hash://sha256/bedcc1f122d59ec002e0e6d2802c0e422eadf6208669fff141a895bd3ed15d4a https://hosted-datasets.gbif.org/datasets/fauna_europaea.tar.gz

And those 3 gzips are indeed DwC-As

The octet-stream is a GBIF registry page, really a json; maybe the file command just went nuts:

$ join urlVersions -1 2 <(grep "octet-stream" hashFileTypes | cut -f1) -2 1
hash://sha256/b3c0d1b95b945cb0fee24f4586ee8beadf0e28871792c19bb078d7d4696d5f24 https://api.gbif.org/v1/dataset?offset=53940&limit=20
$ preston get hash://sha256/b3c0d1b95b945cb0fee24f4586ee8beadf0e28871792c19bb078d7d4696d5f24 --remote https://preston.guoda.bio/ | cut -b-100
{"offset":53940,"limit":20,"endOfRecords":false,"count":62389,"results":[{"key":"7fe1bc0e-f762-11e1-

and the inode/x-empty is just an empty file. Can confirm by sha256suming an empty stream:

$ join urlVersions -1 2 <(grep "x-empty" hashFileTypes | cut -f1) -2 1
hash://sha256/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 http://bdj.pensoft.net/lib/ajax_srv/archive_download.php?archive_type=2&document_id=6313
$ echo -n "" | sha256sum
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

mielliott commented 2 years ago

A grid of 1024 x 768 pixels would give about 786k / 60k > 10 pixels per dataset as defined by their hash. Big enough to be a perceivable dot on a screen . . . Then show the pattern for gbif, idigbio and combined. This would give a visual comparison. Also color may be used to code the frequency of occurrence. Probably something you already thought about .

Curious to hear your thoughts!

Love it! A timelapse animation of that would be especially fun.... @JimmyFromRobotics :eyes:

bio-linker / us-dataset-finder

Counting files inside zips #1