yarikoptic commented 3 years ago

Follow up/extract from previous discussion: https://github.com/dandi/dandi-api/issues/146#issuecomment-791850364 (and elsewhere).

156 is reworking upload and blob-identification. ATM blob identification relies on amazon's ETag generation algorithm with our (but still outside of our code base) custom decision on "blocksize" to produce parts. I think #156 should be deployed in our beta instance as-is first so we could verify correct operation etc and to not have that PR blocked by our discussion/decision making here.

In general ATM (as of #150 and #156 implementation) our blob identity is the etag (AKA dandi:dandi-s3-etag) which is an md5 checksum of content or md5sums of individual chunks.

identity value

There were concerns in md5 and thus etag might have insufficient collision-avoidance "guarantee". One simple and likely to be sufficient (for a rare possibility of a collision) strategy which was suggested in #146 and elsewhere: make a composite with the file size and may be even a block size used to compute that etag. So it could be <size>-<blocksize>-<etag>.

FWIW, just possible other benefits (and my take on them) from such an identity:

knowing blocksize is sufficient to recompute etag following the simple algorithm of "md5 of md5s": nice, but not really solving any problem really
knowing size allows to estimate (or check) blocksize if our "blocksize logic" is not available: also not really that important
knowing size from the identity gives immediate information about the size of the blob: nice to have but not anyhow critical.

So, it seems that "collision-avoidance" is the primary purpose of making such an identity; and since given the same size, blocksize would be the same - it would not add any additional collision protection. So <size>-<etag> should be as good and stronger than a pure <etag>.

A notable problem with such approach:

we would need to decide NOW on maximal file and block size to decide on the target length of such an identity to become usable in a DB (correct me if I am wrong). Workaround - <size> could become <size%somemaxsize>. That would obliterate the minor benefits stated above, and to avoid confusion might need to add some syntax to state that it was modulo and thus real value is bigger (could be + instead of - as a separator)
I don't see any other major problem, but there might be an additional issue or this is not a problem at all?

identity name

etag becomes a table column name, but that name is different from what we use in metadata records (dandi-s3-etag). Ideally we should have name consistent across API "terms", and metadata; or have it completely abstract (such as identity) on the API side.

With the above change (adding the <size>-), etag name might become inappropriate/confusing, so we might either use more specific name. I guess having - in the name introduces difficulty in mapping to column names etc. With that in mind I can offer two approaches

name it just identity, as we had in the design document (#150), which internal logic would maps to whatever algorithm we use now (or upon some migration in the future?)
given it a new unique name such as dandidigest1 (as for our custom etag of first version) which would correspond to whatever we settle on for this round, and if later we come up with a new one, it would become dandidigest2. Then we 1) use it in the schema as well (would become dandi:dandidigest1 in the digests for "algorithm" with "dandi" namespace); 2) might register multihash code for it if we decide to use multihashes anywhere.

TODOs

[x] decide on value
[x] decide on name(s)
- [ ] make those name(s) used by dandi-api and dandi-cli
[ ] provide singular implementation (within dandi-cli; ref: https://github.com/dandi/dandi-cli/issues/464) and regex
- [ ] make that implementation used by dandi-api and dandi-cli

satra commented 3 years ago

to me key is not digest and etag should never be used in the blob store (i thought we decided on uuid there). so let's not mix digest etag with key in blob store.

yarikoptic commented 3 years ago

@satra -- thanks, you are 100% correct. I will blame git-annex's notion of a "key" (FTR, alike <BACKEND>-s<SIZE>--<CHECKSUM">[.EXTENSION]) for my confusion.
key is out of question here (we indeed settled on a UUID) and this issue about blob's "identity" (AKA digest, AKA etag)

PS I have adjusted title and initial description to use "identity" instead of key

dchiquito commented 3 years ago

we would need to decide NOW on maximal file and block size to decide on the target length of such an identity to become usable in a DB (correct me if I am wrong).

S3 already has those maximums: 5TB max file size, 10,000 max number of parts, 5GB max part size.

yarikoptic commented 3 years ago

But AWS are not into "carving in stone" business, so what if upon popular demand they beef up limits e.g. x10 fold? (although 5TB files are already an obnoxious use case so IMHO unlikely to happen). Sure we could aim for 5TB file size limit and then reconsider if limits would ever happen to increase.

satra commented 3 years ago

@yarikoptic - i thought you also wanted to have any logic on chunking and size be inside the algorithm? so that if aws changed things in the future, it would mostly be adding more to the algorithm rather than parameters to the algorithm.

yarikoptic commented 3 years ago

yes. "algorithm" encapsulates everything needed to get from file to the target value of "identity" str. But here I was concerned about "upper limit" on that identity str length, since it determines the necessary string size for DB column to contain the identity. So if S3 increases the limit, and our max length changes (due to the changes in the algorithm) - it would either require DB migration etc (not fun) or declaring that despite S3 supports larger files, we support only up to (but not including) 10TB

satra commented 3 years ago

is the fixed length thing about etag part number? what if you only store the md5 without partnumber? then your algorithm always outputs an md5 (fixed string) according to some deterministic process. obviously content size and block would be nice, but are they necessary.

yarikoptic commented 3 years ago

I thought we wanted to add file size (not just number of parts) to provide better collision avoidance. I still think it is a good idea

satra commented 3 years ago

could we assume that we will not exceed a petabyte in size per file, and if you do we are in trouble?

if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.

waxlamp commented 3 years ago

I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics. What's important here is to figure out what the API surface will look like, which can (and should) be totally independent of internal implementation choices like DB table names, etc.

@yarikoptic, could you provide some detail on what dandi-cli will need, in API terms?

waxlamp commented 3 years ago

Also, a general question: why are we so worried about md5 collisions? According to this article the probability of two md5 is ~1.5e-29. I don't know how many heads in a row that equates to, but I think it's a lot 😸. Furthermore, if Amazon doesn't worry about this, why should we?

And going a bit further, I guess I am curious what the threat model is for such a collision. I would bet a lot of money that cosmic rays will never in our lifetimes produce a rogue md5 aliasing event. It is possible for a malicious agent with time and resources to do it, but even then, I am curious what the threat is: evil scientists hacking dandi-cli to cause other scientists to upload bad data to DANDI?

(I realize this comes off as a bit tongue-in-cheek, but I am earnestly curious about the danger here--if it is minimal, then we could dump a lot of this discussion and use the stock algorithms and move on to other challenges.)

satra commented 3 years ago

Amazon doesn't care about it, because they are not using the ETag as an identity provider, we are. We are using that identity provision to return back uuid and a notion of whether a blob is already in storage or not. adding contentsize would simply make sure that probability of collision is minimized even more.

also, in DANDI, if we move to zarr, we may be adding 100s to 100k blobs per asset. so having something that protects against a key that's use for identity would be helpful. also the addition proposed here for our algorithm is a very easy check.

yarikoptic commented 3 years ago

I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics

aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?

wouldn't it be less confusing for other (non-authors and authors but after a week or two of time away from the code) engineers whenever there is a clear relationship between column names and what they contain?

yarikoptic commented 3 years ago

if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.

I would stick to hex as the one used by md5 itself so we do not fall into some culprit of mixed/changed casing etc. so it would add 13 characters

In [5]: '%x' % int(1e+15)
Out[5]: '38d7ea4c68000'

In [6]: len('%x' % int(1e+15))
Out[6]: 13

@waxlamp Thanks for the link. Interesting example there where collision could be crafted with a careful change of a few bytes while keeping the size intact. So adding size as part of the "identity" would nohow protect against malicious users. But probability of random collision is indeed VERY low as you pointed out. Either reducing it 1e10 times lower (ballpark from adding 13 digits to 32) would be critical -- hard to tell, and most likely not, but the idea was "better be safe than sorry" ;-)

But may be we should indeed just "prevent" it and not "avoid it", i.e. clients will provide sizes, we will know sizes (from DB) so we could always check and error out (and send us notification) that upload was attempted for a colliding etag for a different size. And only if such collision happens we then consider "what should we do about it"?

dchiquito commented 3 years ago

aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?

The only DB information that is "leaked" are URL parameters, like POST /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/{uuid}/. Those URL parameters are automatically used to select a specific row in a table, so Django needs to know which columns to use to do the SELECT. The swagger generator we are using incidentally uses that information to describe the URL.

I would consider that to be a minor documentation detail, not anything to do with the API surface. We can define the request/response bodies to be basically anything reasonable.

yarikoptic commented 3 years ago

ok, having deliberated on this decided to take the path of the least resistance ;)

we take current amazon's etag which is <md5ofparts>-<number of parts> as an md5 with additional collision avoidance (due to that -<number of parts>
- cons: not as much of collision avoidance as with the full size incorporated, no benefit of knowing the size from etag alone, if limits are raised in number of parts upload - size of it could increase as well
- pros: ready to be matched to etag provided directly from S3 (and all prior consideration like possibility for parallization etc)
dandi-api should (if not already) verify size matching and if not matches provided that etags match -- error out and so we get alerted (sentry?)

I have submitted https://github.com/dandi/dandi-cli/pull/474 to provide implementation/interface, please chime in on how to RF/improve so we could "live with it" having it merged/released

yarikoptic commented 3 years ago

as the name - let's call it just dandietag (or dandi:dandietag if with the domain in metadata and DANDIEtag for the helper class) .

dchiquito commented 3 years ago

I personally object to dandietag/DANDIEtag because my brain parses it as dandie tag/DANDIE tag.

I liked dandi:dandi-s3-etag as it conveyed that it was based on the S3 ETag, but was only meaningful in a DANDI context. Would a separator (dandi-etag) work?

yarikoptic commented 3 years ago

Sure - let's have it dandi-etag ! For some reason I thought that - isn't ok in the names for https://github.com/multiformats/multicodec/blob/master/table.csv but I was wrong - it is full of the ones with - and some times with multiple. So later we might have dandi-etag-2 ;-)

yarikoptic commented 3 years ago

Ok, work in progress is going on, and dedicated https://github.com/dandi/dandi-api/issues/160 filed within the project for dandi-api, so I will just close this "discussion" issue and thank all participants ;-)

dandi / dandi-archive

Decide on the exact "algorithm" (and thus the value) and name for `etag` blob "identity" #157

identity value

identity name