cboettig / contentid

:package: R package for working with Content Identifiers
http://cboettig.github.io/contentid
Other
44 stars 2 forks source link

Handling large files #86

Open joelnitta opened 2 years ago

joelnitta commented 2 years ago

I am trying to register some large files, for example uniprot_trembl.fasta.gz at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete (53Gb).

It is taking a long time (no surprise) but I wonder if this is due more to downloading the file or to calculating the hash. If the latter, might it be possible to speed things up with some sort of faster algorithm or parallelization? (I have basically no knowlege of hash algorithms, just wanted to see if this might be a possibility...)

Any other tips on handling large files with {contentid} would also be appreciated if you have any ideas. Thanks!

joelnitta commented 2 years ago

Sorry that was bit premature: I just tried curling the file, then register()ing the downloaded file, and it was clear that the vast majority of time is spent downloading, not calculating the hash. Still, I encountered this warning message, not sure if it's something to worry about:

> register("temp/uniprot_trembl.fasta.gz", registries = "data/local.tsv")
[1] "hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1"
Warning message:
In data.frame(identifier = as_hashuri(id), source = as_chr(source),  :
  NAs introduced by coercion to integer range
cboettig commented 2 years ago

@joelnitta Thanks. The warning isn't that important, suggests that it got a status code or file-size back that it couldn't coerce into an integer, though not sure why.

Yup, large files can be a challenge, though contentid already does a few things that may be relevant:

All the same, for really really large data, downloading at all is often not an ideal workflow... we can be better off computing against remote objects in cloud storage directly.... (see #82)

jhpoelen commented 2 years ago

@joelnitta @cboettig great discussion on large files!

I tried to download and calculate the hash of the file using a different tool using a consumer grade 500Mb fiber internet connection from Minneapolis and found:

$ time preston track https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://preston.guoda.bio> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://preston.guoda.bio> <http://purl.org/dc/terms/description> "Preston is a software program that finds, archives and provides access to biodiversity datasets."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://purl.org/dc/terms/description> "A crawl event that discovers biodiversity archives."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/ns/prov#startedAtTime> "2022-04-19T21:46:07.751Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <http://www.w3.org/ns/prov#wasStartedBy> <https://preston.guoda.bio> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/ns/prov#usedBy> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/dcmitype/Software> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<https://doi.org/10.5281/zenodo.1410543> <http://purl.org/dc/terms/bibliographicCitation> "Jorrit Poelen, Icaro Alzuru, & Michael Elliott. 2021. Preston: a biodiversity dataset tracker (Version 0.0.1-SNAPSHOT) [Software]. Zenodo. http://doi.org/10.5281/zenodo.1410543"@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<urn:uuid:0659a54f-b713-4f86-a917-5be166a14110> <http://purl.org/dc/terms/description> "A biodiversity dataset graph archive."@en <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> .
<hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <http://www.w3.org/ns/prov#wasGeneratedBy> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <http://www.w3.org/ns/prov#qualifiedGeneration> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#generatedAtTime> "2022-04-19T23:18:24.963Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Generation> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#wasInformedBy> <urn:uuid:2fe6cf39-d75e-4cea-b390-d4c4c37b9bc5> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> <http://www.w3.org/ns/prov#used> <https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .
<https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz> <http://purl.org/pav/hasVersion> <hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1> <urn:uuid:fb8595c3-45dc-4f7d-8b03-38baa9ec745f> .

real    92m18.167s
user    40m39.826s
sys 3m33.369s

I believe this means that it took 92 minutes to retrieve the asset. More later, about to have dinner ; )

joelnitta commented 2 years ago

@cboettig Thanks for the info.. actually this is tangentially related to #82 also because this file is on an FTP server, and I was unable to use hash archive. The work-around is to use a local registry and include it in my code repo, but it is slightly less portable because that requires an additional file to go along with the code.

Sounds like crc32 might be worth it for really big files. A few minutes for calculating sha-256 of a 50gb file isn't too bad.

@jhpoelen That sounds about right. curl took 2 hr for me, whereas the hash calculation was a few minutes.

uniprot_trembl.fasta.gz       100%[=====>]  52.75G  13.9MB/s    in 1h 54m
jhpoelen commented 2 years ago

Just finished dinner . . .

$ preston cat hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv  > /dev/null
52.8GiB 0:04:31 [ 198MiB/s] 

little faster without preston, using vanilla cat instead

$ cat data/e3/90/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv > /dev/null
52.8GiB 0:03:32 [ 254MiB/s]

now calculating sha256

$ cat data/e3/90/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1 | pv | sha256sum
52.8GiB 0:06:43 [ 133MiB/s] [      <=>                                                                         ]
e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1  -

Nothing new for you probably, and as I was waiting for my poor laptop to crunch those numbers, the following thoughts crossed my mind:

  1. the integer overflow is probably the 2GB limit of max integer range. I see many folks use longs or bigints instead of ints.
  2. calculating a hash is a price you have to pay for verifying / storing - perhaps a way to think of this as the proof-of-work to indicate you are serious about processing the data and . . . an incentive to share in bite size chunks
  3. I suspect that I/O (network or disk) are typically limiting factors, and a content id based system can help alleviate this load by distributing the content closer and/or chunking the data to enable parallel I/O streams
  4. in the content-based world, addressing ranges can be done reliably . . so if you need only the the first 10 lines, you can say generate stuff like in less than a second:
$ time preston cat 'line:gz:hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/L10'
YYGMDLIKSKYKVDWKNPNPKDKVKCTQEIVNEISTEVALNAMENYEMFPTALEDHFGGS
real    0m0.756s
user    0m1.200s
sys 0m0.107s

with this, you can imagine trusted CDN's (content delivery networks) to strategically store/cache whole or parts of the file "big" file, and make the data a little easier to handle.

related issues include https://github.com/bio-guoda/preston/issues/163 .

jhpoelen commented 2 years ago

fyi @mielliott

jhpoelen commented 2 years ago

fyi @pgasu

cboettig commented 2 years ago

@jhpoelen but this is brilliant! it never occurred to me to use range requests here.

Hmmm.... as you point out, it would be do little good to add a faster hashing algorithm like crc32 in most cases, since as in this example the vast majority of the time is in downloading.

Using range requests when you know what portion of the data you want makes lots of sense, though I guess that we often don't know that a-priori.

However, if we merely need to compute a reliable identifier for an object, it is tempting to consider a convention like "the hash of the first 1024 bytes" or something like that as the file's identifier, e.g. something like

curl -r 0-1024 https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz | sha256sum

Maybe this would require a different syntax to distinguish it from a content-id based on the full string, e.g. maybe we call this:

bytes:1024:hash://sha256/04355287211f548988a1671d08c3946a7c3a03ce1d6a351a0df4f207a78d5a13

or

hash://sha256/04355287211f548988a1671d08c3946a7c3a03ce1d6a351a0df4f207a78d5a13?bytes=1024

Obviously this has clear failure modes, like when a document is sharded or extended (e.g. adding additional rows). And repositories like Software Heritage and DataONE would never be query-able by partial hashes since such hashes are of no use as checksums for the integrity of the whole file. Still, seems like there would be ample cases with very large files where such identifiers would be fit-to-purpose I think? Or is that just asking for trouble?

jhpoelen commented 2 years ago

@cboettig neat idea to also calculate (partial) hashes. I'd be curious to see how many files share the same 1k bytes. Surely, someone somewhere did some reason on that.

Some rambling thoughts:

And . . . I imagine a two-tier search would help speed up content discovery.

Similar to how bloom filters help exclude true negative matches, the header hashes could serve as a quick-and-dirty search space reduction or validation:

  1. first content id resolve question: do you have anything with a header hash of X?
  2. first verification question: do the first 100 bytes I received match the known header hash?

Ideally, 1./2. would eliminate a ton of CPU / bandwidth otherwise spent on searching an entire index or downloading a giant file that actually does not contain what you are looking for.

jhpoelen commented 2 years ago

btw - a notation we've been using is (introduced by @mielliott) :

$ preston cat 'cut:hash://sha256/e390d25b32712772d7e4a0c7f104160b7e9d851073f842ccf1b19e79f34697b1!/b1-100' | sha256sum
373a15f28581f5932da9ed6f2cc654cd2a115355e6970328ab0306210bab14ad  -

uses posix / linux cut notation for byte range. Am getting a ton of miles out of these ancient and powerful unix tools. (reminds me of our experimental workshop at https://www.idigbio.org/wiki/index.php/Digital_Data_in_Biodiversity_Research_Conference,_Berkeley#Monday.2C_4_June_2018 https://www.idigbio.org/wiki/images/a/a1/1-Poelen-frugal_tools_2018-06-04.pdf )

see also https://github.com/bio-guoda/preston/issues/82

cboettig commented 2 years ago

Thanks @jhpoelen , really cool stuff here. I like your prefix notation, I'd recently encountered that in GDAL's virtual filesystems: https://gdal.org/user/virtual_file_systems.html, but hadn't seen the Apache common-vfs you mentioned (perhaps GDAL's version is based on that as well). A bit of a tangent but I've been relying on these virtual filesystem approaches in GDAL and apache arrow quite a bit recently to access pieces of large remote files, and have been struggling to envision a way of making that more contentid compatible (e.g. in #82).

If I understand the two-step search, the idea would be to speed up the case where the URL returned for some hash happens to have different content, and this would allow resolve() to skip it and go on to try the next source (if one is available) more quickly? That's a nice example, though it does feel a bit like an edge case.

Yes, you're probably right that there must be some theory on sharing bytes. using the header may be particularly fragile, there's probably some much more clever way to construct a range request to reduce collision probability (I'm reminded of how readr library switched its algorithm for guessing the data type of csv columns from looking at the first 1000 lines to sampling 1000 lines throughout the document, eg. https://github.com/tidyverse/vroom/issues/352). Obviously still not infallible but would be much more robust than looking at headers....

p.s. thanks for the note about the long int on filesize, yeah that's a great point I'll drop into a separate issue.

mbjones commented 2 years ago

I like the conversation on progressive reduction of the search space based on the contentid of the first n bytes. This reminds me of the space-filling curve approach in geohash and S2Geometry geospatial grid search, where the hash value is topologically linked to spatial location such that any two hashes that share a prefix are "close" to each other in geo-space. That is, they preserve the locality of reference. For example, the two geohash values u4pruydqqvj and u4pruydvqvj are within the region specified by the prefix u4pruyd, and so this allows very rapid reductions in search space based on simple string indexes. Is there an equivalent to a space-filling hilbert curve that we could use for content-based "proximity"?

jhpoelen commented 2 years ago

@mbjones we've (@mielliott) experimented with "semantic" hashes that allow for calculating distances. There's actually some features built into Preston that help calculate these hashes (e.g., bloom filter, tlsh hashes) and do some set commands. I guess the trick is to balance computational efficiency with a partitioning all possible content into a uniformly distributed search space.

jhpoelen commented 2 years ago

oh, and a more simplistic approach we use to store hashes on disk -

[a-f0-9]{2}/[a-f0-9]{2}/

e.g.,

$ preston track "https://duckduckgo.com"
[...]
$ find data -type f
data/f4/dc/f4dc427b098916734b6a3a5a6b805c87e87e2cb48e629262230c4bec3b273369
data/2a/5d/2a5de79372318317a382ea9a2cef069780b852b01210ef59e06b640a3539cb5a
data/84/73/847391d8641c5a3d26bc218069b018f0f2c13b7508eb1cd5750321d8419c97ca

to randomly partition content by their hashes. This distribution is expected to be uniform, but similar content will probably not be partitioned together.

jhpoelen commented 2 years ago

I call these hash coordinates the "GPS" coordinates in the content-based universe, or the Teddy-verse (named after Ted Nelson, who had a more content-based take on linking data then Tim Berner-Lee's name based universe (aka the Timmy-verse)) .

jhpoelen commented 2 years ago

accidentally closed - apologies - I am getting too excited about this topic ; )

cboettig commented 1 year ago

Just revisiting this. To summarise:

cboettig commented 1 year ago

I think we resolved the OP in that the slow speed observed by @joelnitta was merely due to slow download and independent of contentid. In workflows like this which already involve downloading all the bytes the existing contentid implementations should be sufficient, in that (a) they aren't adding significantly more overhead to the runtime, and (b) can help avoid re-downloading if code uses embedded contentid references.

checking the first 100 bytes as a signature still seems like it could be applicable when combined with workflows discussed in #82 that will rely on range requests anyway, but is probably unwise to rely on the first 100 bytes. Some simple heuristic (first 100 + last 100 bytes?) would probably be better, though an agreed-upon standard rather than an arbitrary mechanism would be clearly preferred for anything that wants to count as an 'identifier' of the data.

Will add some further thoughts specific to workflows that never download the full objects over in #82