Open joelnitta opened 2 years ago
Sorry that was bit premature: I just tried curl
ing the file, then register()
ing the downloaded file, and it was clear that the vast majority of time is spent downloading, not calculating the hash. Still, I encountered this warning message, not sure if it's something to worry about:
> register("temp/uniprot_trembl.fasta.gz", registries = "data/local.tsv")
[1] "hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1"
Warning message:
In data.frame(identifier = as_hashuri(id), source = as_chr(source), :
NAs introduced by coercion to integer range
@joelnitta Thanks. The warning isn't that important, suggests that it got a status code or file-size back that it couldn't coerce into an integer, though not sure why.
Yup, large files can be a challenge, though contentid
already does a few things that may be relevant:
When using a URL, register()
with a local registry (or merely content_id()
) will stream the file, meaning that it computes the hash as it downloads and you never have to store the whole file to disk first. This can technically hash files much larger than available disk space.
we use the openssl bindings for the hash algorithms, which support assembly-level code for hardware-specific acceleration. The SHA-family of algorithms is so essential to modern computing that newer CPU chips include specific acceleration for them built into the chip, meaning that sha-256 can actually be faster than older, simpler algorithms like md5 on this hardware. High performance network connections can still transfer faster than even hardware-accelerated algorithms can hash though, so it still can be a bottleneck.
There are faster, non-cryptographic quality algorithms that could be used as hash-ids, like crc32, but I haven't added those in. obviously it's less convenient when a given object is known by a lot of different identifiers determined by different hashes. contentid
(like hash-archive.org) does have the ability to simultaneously compute multiple hashes (md5, sha1, sha256, sha512), and thus resolve any content to any one of those recognized; so maybe it is worth adding support for a few fast simple hashes like crc32.
All the same, for really really large data, downloading at all is often not an ideal workflow... we can be better off computing against remote objects in cloud storage directly.... (see #82)
@joelnitta @cboettig great discussion on large files!
I tried to download and calculate the hash of the file using a different tool using a consumer grade 500Mb fiber internet connection from Minneapolis and found:
$ time preston track https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://preston.guoda.bio> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/ns/prov#startedAtTime> "2022-04-19T21:46:07.751Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/ns/prov#wasStartedBy> <https://preston.guoda.bio> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/dcmitype/Software> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://purl.org/dc/terms/bibliographicCitation> "Jorrit Poelen, Icaro Alzuru, & Michael Elliott. 2021. Preston: a biodiversity dataset tracker (Version 0.0.1-SNAPSHOT) [Software]. Zenodo. http://doi.org/10.5281/zenodo.1410543"@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/dc/terms/description> "A biodiversity dataset graph archive."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#generatedAtTime> "2022-04-19T23:18:24.963Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#used> <https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz> <http://purl.org/pav/hasVersion> <hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
real 92m18.167s
user 40m39.826s
sys 3m33.369s
I believe this means that it took 92 minutes to retrieve the asset. More later, about to have dinner ; )
@cboettig Thanks for the info.. actually this is tangentially related to #82 also because this file is on an FTP server, and I was unable to use hash archive. The work-around is to use a local registry and include it in my code repo, but it is slightly less portable because that requires an additional file to go along with the code.
Sounds like crc32 might be worth it for really big files. A few minutes for calculating sha-256 of a 50gb file isn't too bad.
@jhpoelen That sounds about right. curl
took 2 hr for me, whereas the hash calculation was a few minutes.
uniprot_trembl.fasta.gz 100%[=====>] 52.75G 13.9MB/s in 1h 54m
Just finished dinner . . .
$ preston cat hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv > /dev/null
52.8GiB 0:04:31 [ 198MiB/s]
little faster without preston, using vanilla cat instead
$ cat data/e3/90/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv > /dev/null
52.8GiB 0:03:32 [ 254MiB/s]
now calculating sha256
$ cat data/e3/90/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv | sha256sum
52.8GiB 0:06:43 [ 133MiB/s] [ <=> ]
e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 -
Nothing new for you probably, and as I was waiting for my poor laptop to crunch those numbers, the following thoughts crossed my mind:
$ time preston cat 'line:gz:hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/L10'
YYGMDLIKSKYKVDWKNPNPKDKVKCTQEIVNEISTEVALNAMENYEMFPTALEDHFGGS
real 0m0.756s
user 0m1.200s
sys 0m0.107s
with this, you can imagine trusted CDN's (content delivery networks) to strategically store/cache whole or parts of the file "big" file, and make the data a little easier to handle.
related issues include https://github.com/bio-guoda/preston/issues/163 .
fyi @mielliott
fyi @pgasu
@jhpoelen but this is brilliant! it never occurred to me to use range requests here.
Hmmm.... as you point out, it would be do little good to add a faster hashing algorithm like crc32 in most cases, since as in this example the vast majority of the time is in downloading.
Using range requests when you know what portion of the data you want makes lots of sense, though I guess that we often don't know that a-priori.
However, if we merely need to compute a reliable identifier for an object, it is tempting to consider a convention like "the hash of the first 1024 bytes" or something like that as the file's identifier, e.g. something like
curl -r 0-1024 https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz | sha256sum
Maybe this would require a different syntax to distinguish it from a content-id based on the full string, e.g. maybe we call this:
bytes:1024:hash://sha256/04355287211f548988a1671d08c3946a7c3a03ce1d6a351a0df4f207a78d5a13
or
hash://sha256/04355287211f548988a1671d08c3946a7c3a03ce1d6a351a0df4f207a78d5a13?bytes=1024
Obviously this has clear failure modes, like when a document is sharded or extended (e.g. adding additional rows). And repositories like Software Heritage and DataONE would never be query-able by partial hashes since such hashes are of no use as checksums for the integrity of the whole file. Still, seems like there would be ample cases with very large files where such identifiers would be fit-to-purpose I think? Or is that just asking for trouble?
@cboettig neat idea to also calculate (partial) hashes. I'd be curious to see how many files share the same 1k bytes. Surely, someone somewhere did some reason on that.
Some rambling thoughts:
And . . . I imagine a two-tier search would help speed up content discovery.
Similar to how bloom filters help exclude true negative matches, the header hashes could serve as a quick-and-dirty search space reduction or validation:
Ideally, 1./2. would eliminate a ton of CPU / bandwidth otherwise spent on searching an entire index or downloading a giant file that actually does not contain what you are looking for.
btw - a notation we've been using is (introduced by @mielliott) :
$ preston cat 'cut:hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/b1-100' | sha256sum
373a15f28581f5932da9ed6f2cc654cd2a115355e6970328ab0306210bab14ad -
uses posix / linux cut notation for byte range. Am getting a ton of miles out of these ancient and powerful unix tools. (reminds me of our experimental workshop at https://www.idigbio.org/wiki/index.php/Digital_Data_in_Biodiversity_Research_Conference,_Berkeley#Monday.2C_4_June_2018 https://www.idigbio.org/wiki/images/a/a1/1-Poelen-frugal_tools_2018-06-04.pdf )
Thanks @jhpoelen , really cool stuff here. I like your prefix notation, I'd recently encountered that in GDAL's virtual filesystems: https://gdal.org/user/virtual_file_systems.html, but hadn't seen the Apache common-vfs you mentioned (perhaps GDAL's version is based on that as well). A bit of a tangent but I've been relying on these virtual filesystem approaches in GDAL and apache arrow quite a bit recently to access pieces of large remote files, and have been struggling to envision a way of making that more contentid compatible (e.g. in #82).
If I understand the two-step search, the idea would be to speed up the case where the URL returned for some hash happens to have different content, and this would allow resolve()
to skip it and go on to try the next source (if one is available) more quickly? That's a nice example, though it does feel a bit like an edge case.
Yes, you're probably right that there must be some theory on sharing bytes. using the header may be particularly fragile, there's probably some much more clever way to construct a range request to reduce collision probability (I'm reminded of how readr
library switched its algorithm for guessing the data type of csv columns from looking at the first 1000 lines to sampling 1000 lines throughout the document, eg. https://github.com/tidyverse/vroom/issues/352). Obviously still not infallible but would be much more robust than looking at headers....
p.s. thanks for the note about the long int on filesize, yeah that's a great point I'll drop into a separate issue.
I like the conversation on progressive reduction of the search space based on the contentid of the first n bytes. This reminds me of the space-filling curve approach in geohash and S2Geometry geospatial grid search, where the hash value is topologically linked to spatial location such that any two hashes that share a prefix are "close" to each other in geo-space. That is, they preserve the locality of reference. For example, the two geohash values u4pruydqqvj
and u4pruydvqvj
are within the region specified by the prefix u4pruyd
, and so this allows very rapid reductions in search space based on simple string indexes. Is there an equivalent to a space-filling hilbert curve that we could use for content-based "proximity"?
@mbjones we've (@mielliott) experimented with "semantic" hashes that allow for calculating distances. There's actually some features built into Preston that help calculate these hashes (e.g., bloom filter, tlsh hashes) and do some set commands. I guess the trick is to balance computational efficiency with a partitioning all possible content into a uniformly distributed search space.
oh, and a more simplistic approach we use to store hashes on disk -
[a-f0-9]{2}/[a-f0-9]{2}/
e.g.,
$ preston track "https://duckduckgo.com"
[...]
$ find data -type f
data/f4/dc/f4dc427b098916734b6a3a5a6b805c87e87e2cb48e629262230c4bec3b273369
data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
data/84/73/847391d8641c5a3d26bc218069b018f0f2c13b7508eb1cd5750321d8419c97ca
to randomly partition content by their hashes. This distribution is expected to be uniform, but similar content will probably not be partitioned together.
I call these hash coordinates the "GPS" coordinates in the content-based universe, or the Teddy-verse (named after Ted Nelson, who had a more content-based take on linking data then Tim Berner-Lee's name based universe (aka the Timmy-verse)) .
accidentally closed - apologies - I am getting too excited about this topic ; )
Just revisiting this. To summarise:
In many use cases/contexts, it looks like the download speed is rate limiting, while hashing the local object is quite fast, even for sha256 on large objects (particularly given modern components like NVMe drives and recent CPUs with hardware support for SHA). So while this doesn't help deal with cases like #82 where the workflow is not intended to download the entire asset, under any workflow involving full downloads, the additional cost of a hash isn't that much. Additionally, by using a local content store, (e.g. resolve(id, store=TRUE)
option), contentid
can easily avoid unnecessary downloads when the object is already available in the local store.
There are obviously other cases where hashing is rate-limiting, e.g. generating hashes for a large number of very large objects for metadata purposes only, e.g. to determine hash-ids for these objects. However, re-computing hashes for stored objects is already a routine part of file integrity checks in large object stores. (and while md5 is still seen as a cheaper mechanism for this, in modern hardware it's not always faster anyway).
There are probably still many good use cases where a fast hash from the first n-bytes makes a good id. We'd need a syntax to identify this in the identifier, I don't think that such a thing is part of the named information or hash-id spec so far.
I think we resolved the OP in that the slow speed observed by @joelnitta was merely due to slow download and independent of contentid. In workflows like this which already involve downloading all the bytes the existing contentid
implementations should be sufficient, in that (a) they aren't adding significantly more overhead to the runtime, and (b) can help avoid re-downloading if code uses embedded contentid references.
checking the first 100 bytes as a signature still seems like it could be applicable when combined with workflows discussed in #82 that will rely on range requests anyway, but is probably unwise to rely on the first 100 bytes. Some simple heuristic (first 100 + last 100 bytes?) would probably be better, though an agreed-upon standard rather than an arbitrary mechanism would be clearly preferred for anything that wants to count as an 'identifier' of the data.
Will add some further thoughts specific to workflows that never download the full objects over in #82
I am trying to register some large files, for example uniprot_trembl.fasta.gz at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete (53Gb).
It is taking a long time (no surprise) but I wonder if this is due more to downloading the file or to calculating the hash. If the latter, might it be possible to speed things up with some sort of faster algorithm or parallelization? (I have basically no knowlege of hash algorithms, just wanted to see if this might be a possibility...)
Any other tips on handling large files with {contentid} would also be appreciated if you have any ideas. Thanks!