Closed yarikoptic closed 3 years ago
to me key
is not digest and etag should never be used in the blob store (i thought we decided on uuid there). so let's not mix digest etag
with key
in blob store.
@satra -- thanks, you are 100% correct. I will blame git-annex's notion of a "key" (FTR, alike <BACKEND>-s<SIZE>--<CHECKSUM">[.EXTENSION]
) for my confusion.
key
is out of question here (we indeed settled on a UUID) and this issue about blob's "identity" (AKA digest
, AKA etag
)
PS I have adjusted title and initial description to use "identity" instead of key
we would need to decide NOW on maximal file and block size to decide on the target length of such an identity to become usable in a DB (correct me if I am wrong).
S3 already has those maximums: 5TB max file size, 10,000 max number of parts, 5GB max part size.
But AWS are not into "carving in stone" business, so what if upon popular demand they beef up limits e.g. x10 fold? (although 5TB files are already an obnoxious use case so IMHO unlikely to happen). Sure we could aim for 5TB file size limit and then reconsider if limits would ever happen to increase.
@yarikoptic - i thought you also wanted to have any logic on chunking and size be inside the algorithm? so that if aws changed things in the future, it would mostly be adding more to the algorithm rather than parameters to the algorithm.
yes. "algorithm" encapsulates everything needed to get from file to the target value of "identity" str
. But here I was concerned about "upper limit" on that identity str
length, since it determines the necessary string
size for DB column to contain the identity. So if S3 increases the limit, and our max length changes (due to the changes in the algorithm) - it would either require DB migration etc (not fun) or declaring that despite S3 supports larger files, we support only up to (but not including) 10TB
is the fixed length thing about etag part number? what if you only store the md5 without partnumber? then your algorithm always outputs an md5 (fixed string) according to some deterministic process. obviously content size and block would be nice, but are they necessary.
I thought we wanted to add file size (not just number of parts) to provide better collision avoidance. I still think it is a good idea
could we assume that we will not exceed a petabyte in size per file, and if you do we are in trouble?
if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.
I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics. What's important here is to figure out what the API surface will look like, which can (and should) be totally independent of internal implementation choices like DB table names, etc.
@yarikoptic, could you provide some detail on what dandi-cli will need, in API terms?
Also, a general question: why are we so worried about md5 collisions? According to this article the probability of two md5 is ~1.5e-29. I don't know how many heads in a row that equates to, but I think it's a lot 😸. Furthermore, if Amazon doesn't worry about this, why should we?
And going a bit further, I guess I am curious what the threat model is for such a collision. I would bet a lot of money that cosmic rays will never in our lifetimes produce a rogue md5 aliasing event. It is possible for a malicious agent with time and resources to do it, but even then, I am curious what the threat is: evil scientists hacking dandi-cli to cause other scientists to upload bad data to DANDI?
(I realize this comes off as a bit tongue-in-cheek, but I am earnestly curious about the danger here--if it is minimal, then we could dump a lot of this discussion and use the stock algorithms and move on to other challenges.)
Amazon doesn't care about it, because they are not using the ETag as an identity provider, we are. We are using that identity provision to return back uuid
and a notion of whether a blob is already in storage or not. adding contentsize would simply make sure that probability of collision is minimized even more.
also, in DANDI, if we move to zarr, we may be adding 100s to 100k blobs per asset. so having something that protects against a key that's use for identity would be helpful. also the addition proposed here for our algorithm is a very easy check.
I'd like to de-emphasize deciding internal table names in the DB in this discussion--that's something best left to the engineering team as we implement the necessary semantics
aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?
wouldn't it be less confusing for other (non-authors and authors but after a week or two of time away from the code) engineers whenever there is a clear relationship between column names and what they contain?
if so 1PB = 1e+15 = 62^N => N = 15/log10(62) = 8.3... which means if you use 9 base62 characters you can store any size upto 1PB + a bit more in a fixed byte string. that would allow storing size + md5 in a fixed length string.
I would stick to hex as the one used by md5 itself so we do not fall into some culprit of mixed/changed casing etc. so it would add 13 characters
In [5]: '%x' % int(1e+15)
Out[5]: '38d7ea4c68000'
In [6]: len('%x' % int(1e+15))
Out[6]: 13
@waxlamp Thanks for the link. Interesting example there where collision could be crafted with a careful change of a few bytes while keeping the size intact. So adding size as part of the "identity" would nohow protect against malicious users. But probability of random collision is indeed VERY low as you pointed out. Either reducing it 1e10 times lower (ballpark from adding 13 digits to 32) would be critical -- hard to tell, and most likely not, but the idea was "better be safe than sorry" ;-)
But may be we should indeed just "prevent" it and not "avoid it", i.e. clients will provide sizes, we will know sizes (from DB) so we could always check and error out (and send us notification) that upload was attempted for a colliding etag for a different size. And only if such collision happens we then consider "what should we do about it"?
aren't those the ones users would see reflected e.g. in https://api.dandiarchive.org/swagger/ ?
The only DB information that is "leaked" are URL parameters, like POST /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/{uuid}/
. Those URL parameters are automatically used to select a specific row in a table, so Django needs to know which columns to use to do the SELECT. The swagger generator we are using incidentally uses that information to describe the URL.
I would consider that to be a minor documentation detail, not anything to do with the API surface. We can define the request/response bodies to be basically anything reasonable.
ok, having deliberated on this decided to take the path of the least resistance ;)
<md5ofparts>-<number of parts>
as an md5 with additional collision avoidance (due to that -<number of parts>
dandi-api
should (if not already) verify size matching and if not matches provided that etags match -- error out and so we get alerted (sentry?)I have submitted https://github.com/dandi/dandi-cli/pull/474 to provide implementation/interface, please chime in on how to RF/improve so we could "live with it" having it merged/released
as the name - let's call it just dandietag
(or dandi:dandietag
if with the domain in metadata and DANDIEtag
for the helper class) .
I personally object to dandietag
/DANDIEtag
because my brain parses it as dandie tag
/DANDIE tag
.
I liked dandi:dandi-s3-etag
as it conveyed that it was based on the S3 ETag, but was only meaningful in a DANDI context. Would a separator (dandi-etag
) work?
Sure - let's have it dandi-etag
! For some reason I thought that -
isn't ok in the names for https://github.com/multiformats/multicodec/blob/master/table.csv but I was wrong - it is full of the ones with -
and some times with multiple. So later we might have dandi-etag-2
;-)
Ok, work in progress is going on, and dedicated https://github.com/dandi/dandi-api/issues/160 filed within the project for dandi-api, so I will just close this "discussion" issue and thank all participants ;-)
Follow up/extract from previous discussion: https://github.com/dandi/dandi-api/issues/146#issuecomment-791850364 (and elsewhere).
156 is reworking upload and blob-identification. ATM blob identification relies on amazon's ETag generation algorithm with our (but still outside of our code base) custom decision on "blocksize" to produce parts. I think #156 should be deployed in our beta instance as-is first so we could verify correct operation etc and to not have that PR blocked by our discussion/decision making here.
In general ATM (as of #150 and #156 implementation) our blob identity is the
etag
(AKAdandi:dandi-s3-etag
) which is an md5 checksum of content or md5sums of individual chunks.identity value
There were concerns in md5 and thus
etag
might have insufficient collision-avoidance "guarantee". One simple and likely to be sufficient (for a rare possibility of a collision) strategy which was suggested in #146 and elsewhere: make a composite with the file size and may be even a block size used to compute that etag. So it could be<size>-<blocksize>-<etag>
.FWIW, just possible other benefits (and my take on them) from such an identity:
So, it seems that "collision-avoidance" is the primary purpose of making such an identity; and since given the same
size
,blocksize
would be the same - it would not add any additional collision protection. So<size>-<etag>
should be as good and stronger than a pure<etag>
.A notable problem with such approach:
<size>
could become<size%somemaxsize>
. That would obliterate the minor benefits stated above, and to avoid confusion might need to add some syntax to state that it wasmodulo
and thus real value is bigger (could be+
instead of-
as a separator)identity name
etag
becomes a table column name, but that name is different from what we use in metadata records (dandi-s3-etag
). Ideally we should have name consistent across API "terms", and metadata; or have it completely abstract (such asidentity
) on the API side.With the above change (adding the
<size>-
),etag
name might become inappropriate/confusing, so we might either use more specific name. I guess having-
in the name introduces difficulty in mapping to column names etc. With that in mind I can offer two approachesidentity
, as we had in the design document (#150), which internal logic would maps to whatever algorithm we use now (or upon some migration in the future?)dandidigest1
(as for our custom etag of first version) which would correspond to whatever we settle on for this round, and if later we come up with a new one, it would becomedandidigest2
. Then we 1) use it in the schema as well (would becomedandi:dandidigest1
in thedigests
for "algorithm" with "dandi" namespace); 2) might register multihash code for it if we decide to use multihashes anywhere.TODOs
dandi-cli
; ref: https://github.com/dandi/dandi-cli/issues/464) and regex