dandi / dandi-archive

DANDI API server and Web app
https://dandiarchive.org
13 stars 10 forks source link

Decide on the exact "algorithm" (and thus the value) and name for `etag` blob "identity" #157

Closed yarikoptic closed 3 years ago

yarikoptic commented 3 years ago

Follow up/extract from previous discussion: https://github.com/dandi/dandi-api/issues/146#issuecomment-791850364 (and elsewhere).

156 is reworking upload and blob-identification. ATM blob identification relies on amazon's ETag generation algorithm with our (but still outside of our code base) custom decision on "blocksize" to produce parts. I think #156 should be deployed in our beta instance as-is first so we could verify correct operation etc and to not have that PR blocked by our discussion/decision making here.

In general ATM (as of #150 and #156 implementation) our blob identity is the etag (AKA dandi:dandi-s3-etag) which is an md5 checksum of content or md5sums of individual chunks.

identity value

There were concerns in md5 and thus etag might have insufficient collision-avoidance "guarantee". One simple and likely to be sufficient (for a rare possibility of a collision) strategy which was suggested in #146 and elsewhere: make a composite with the file size and may be even a block size used to compute that etag. So it could be <size>-<blocksize>-<etag>.

FWIW, just possible other benefits (and my take on them) from such an identity:

So, it seems that "collision-avoidance" is the primary purpose of making such an identity; and since given the same size, blocksize would be the same - it would not add any additional collision protection. So <size>-<etag> should be as good and stronger than a pure <etag>.

A notable problem with such approach:

identity name

etag becomes a table column name, but that name is different from what we use in metadata records (dandi-s3-etag). Ideally we should have name consistent across API "terms", and metadata; or have it completely abstract (such as identity) on the API side.

With the above change (adding the <size>-), etag name might become inappropriate/confusing, so we might either use more specific name. I guess having - in the name introduces difficulty in mapping to column names etc. With that in mind I can offer two approaches

TODOs

satra commented 3 years ago

to me key is not digest and etag should never be used in the blob store (i thought we decided on uuid there). so let's not mix digest etag with key in blob store.

yarikoptic commented 3 years ago

@satra -- thanks, you are 100% correct. I will blame git-annex's notion of a "key" (FTR, alike <BACKEND>-s<SIZE>--<CHECKSUM">[.EXTENSION]) for my confusion.
key is out of question here (we indeed settled on a UUID) and this issue about blob's "identity" (AKA digest, AKA etag)

PS I have adjusted title and initial description to use "identity" instead of key

dchiquito commented 3 years ago

we would need to decide NOW on maximal file and block size to decide on the target length of such an identity to become usable in a DB (correct me if I am wrong).

S3 already has those maximums: 5TB max file size, 10,000 max number of parts, 5GB max part size.

yarikoptic commented 3 years ago

But AWS are not into "carving in stone" business, so what if upon popular demand they beef up limits e.g. x10 fold? (although 5TB files are already an obnoxious use case so IMHO unlikely to happen). Sure we could aim for 5TB file size limit and then reconsider if limits would ever happen to increase.

satra commented 3 years ago

@yarikoptic - i thought you also wanted to have any logic on chunking and size be inside the algorithm? so that if aws changed things in the future, it would mostly be adding more to the algorithm rather than parameters to the algorithm.

yarikoptic commented 3 years ago

yes. "algorithm" encapsulates everything needed to get from file to the target value of "identity" str. But here I was concerned about "upper limit" on that identity str length, since it determines the necessary string size for DB column to contain the identity. So if S3 increases the limit, and our max length changes (due to the changes in the algorithm) - it would either require DB migration etc (not fun) or declaring that despite S3 supports larger files, we support only up to (but not including) 10TB

satra commented 3 years ago

is the fixed length thing about etag part number? what if you only store the md5 without partnumber? then your algorithm always outputs an md5 (fixed string) according to some deterministic process. obviously content size and block would be nice, but are they necessary.

yarikoptic commented 3 years ago

I thought we wanted to add file size (not just number of parts) to provide better collision avoidance. I still think it is a good idea

satra commented 3 years ago

could we assume that we will not exceed a petabyte in size per file, and if you do we are in trouble?

if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.

waxlamp commented 3 years ago

I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics. What's important here is to figure out what the API surface will look like, which can (and should) be totally independent of internal implementation choices like DB table names, etc.

@yarikoptic, could you provide some detail on what dandi-cli will need, in API terms?

waxlamp commented 3 years ago

Also, a general question: why are we so worried about md5 collisions? According to this article the probability of two md5 is ~1.5e-29. I don't know how many heads in a row that equates to, but I think it's a lot 😸. Furthermore, if Amazon doesn't worry about this, why should we?

And going a bit further, I guess I am curious what the threat model is for such a collision. I would bet a lot of money that cosmic rays will never in our lifetimes produce a rogue md5 aliasing event. It is possible for a malicious agent with time and resources to do it, but even then, I am curious what the threat is: evil scientists hacking dandi-cli to cause other scientists to upload bad data to DANDI?

(I realize this comes off as a bit tongue-in-cheek, but I am earnestly curious about the danger here--if it is minimal, then we could dump a lot of this discussion and use the stock algorithms and move on to other challenges.)

satra commented 3 years ago

Amazon doesn't care about it, because they are not using the ETag as an identity provider, we are. We are using that identity provision to return back uuid and a notion of whether a blob is already in storage or not. adding contentsize would simply make sure that probability of collision is minimized even more.

also, in DANDI, if we move to zarr, we may be adding 100s to 100k blobs per asset. so having something that protects against a key that's use for identity would be helpful. also the addition proposed here for our algorithm is a very easy check.

yarikoptic commented 3 years ago

I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics

aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?

wouldn't it be less confusing for other (non-authors and authors but after a week or two of time away from the code) engineers whenever there is a clear relationship between column names and what they contain?

yarikoptic commented 3 years ago

if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.

I would stick to hex as the one used by md5 itself so we do not fall into some culprit of mixed/changed casing etc. so it would add 13 characters

In [5]: '%x' % int(1e+15)
Out[5]: '38d7ea4c68000'

In [6]: len('%x' % int(1e+15))
Out[6]: 13

@waxlamp Thanks for the link. Interesting example there where collision could be crafted with a careful change of a few bytes while keeping the size intact. So adding size as part of the "identity" would nohow protect against malicious users. But probability of random collision is indeed VERY low as you pointed out. Either reducing it 1e10 times lower (ballpark from adding 13 digits to 32) would be critical -- hard to tell, and most likely not, but the idea was "better be safe than sorry" ;-)

But may be we should indeed just "prevent" it and not "avoid it", i.e. clients will provide sizes, we will know sizes (from DB) so we could always check and error out (and send us notification) that upload was attempted for a colliding etag for a different size. And only if such collision happens we then consider "what should we do about it"?

dchiquito commented 3 years ago

aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?

The only DB information that is "leaked" are URL parameters, like POST ​/dandisets​/{versions__dandiset__pk}​/versions​/{versions__version}​/assets​/{uuid}​/. Those URL parameters are automatically used to select a specific row in a table, so Django needs to know which columns to use to do the SELECT. The swagger generator we are using incidentally uses that information to describe the URL.

I would consider that to be a minor documentation detail, not anything to do with the API surface. We can define the request/response bodies to be basically anything reasonable.

yarikoptic commented 3 years ago

ok, having deliberated on this decided to take the path of the least resistance ;)

I have submitted https://github.com/dandi/dandi-cli/pull/474 to provide implementation/interface, please chime in on how to RF/improve so we could "live with it" having it merged/released

yarikoptic commented 3 years ago

as the name - let's call it just dandietag (or dandi:dandietag if with the domain in metadata and DANDIEtag for the helper class) .

dchiquito commented 3 years ago

I personally object to dandietag/DANDIEtag because my brain parses it as dandie tag/DANDIE tag.

I liked dandi:dandi-s3-etag as it conveyed that it was based on the S3 ETag, but was only meaningful in a DANDI context. Would a separator (dandi-etag) work?

yarikoptic commented 3 years ago

Sure - let's have it dandi-etag ! For some reason I thought that - isn't ok in the names for https://github.com/multiformats/multicodec/blob/master/table.csv but I was wrong - it is full of the ones with - and some times with multiple. So later we might have dandi-etag-2 ;-)

yarikoptic commented 3 years ago

Ok, work in progress is going on, and dedicated https://github.com/dandi/dandi-api/issues/160 filed within the project for dandi-api, so I will just close this "discussion" issue and thank all participants ;-)