Closed yarikoptic closed 3 years ago
@yarikoptic
number and size of added, changed (assuming path identity), or removed assets
Do you want the total sizes of all added/changed/removed assets, or a list of each asset's size, or a mapping from path to size, or what?
if no changes since the last published version
I assume that by "last published" you mean "last published between --from
and --to
". What if there is no published version within the given timeframe?
@yarikoptic Also, should commits containing old-format .dandi/asset.json
files be ignored by this script?
re asset.json "format", forgot that things might have changed etc. May be we should just use file listing from git/annex -- it has checksums (identity), so we can also identify renames -- we should not claim deleted/added for those but rather just that -- renamed (if checksum stays the same).
it should be a summary -- so no individual file names in the report -- just total counts/sizes.
if there is no published version in the period, just state that since all changes reported above would be applicable to "draft".
@yarikoptic Should this script only count NWB files or all annexed files? Or all files, both annexed and non-annexed (excluding those under dot directories)?
Exclude .datalad, count everything else. Later we might decide to provide per file type stats but might not be needed etc
@yarikoptic
I put us in the corner of the --sync again (https://github.com/dandi/dandi-cli/pull/635) ;) but here it is at least simpler, since we can completely ignore file paths, and operate on checksums/annex keys of the files. So, let's forget about "renames" (path dependent), and just report on number/amount of added/removed assets (as content), and only report # and size of duplicates (paths reusing the same keys). So:
@yarikoptic
- If an asset is renamed and a different asset is added to the original path, how should that be reported?
renamed -- changes nothing unless it replaced some existing one (then it is effectively 'delete' of that being replaced). added -- if content is new -- add to "added". If not new -- add to "duplicates"
- How should multiple copies of the same asset in the same revision be handled?
In particular, if a copy of an asset is made between revisions, how should that be reported? What if two copies are made and the original is deleted — is that a rename and a copy, two copies and a removal, or something else?
copy -- adds to duplicates.
1(original) +2 (copies) - 1 (original) = 2 copies. so nothing new added, duplicates increased.
I hope this makes sense now.
@yarikoptic
only report # and size of duplicates (paths reusing the same keys)
@yarikoptic Also, if multiple copies of an asset are added/removed, should that be counted as one addition/removal or multiple, and should the total size of added/removed assets count the copies once or for each path changed?
yet another good point. Let's just report duplicates "changes" as "diff" between numbers of how many duplicates were present in first and 2nd state. So duplicates could be added and removed. could also state how many duplicates are actually left.
- If the same key occurs at n>1 paths, is that 1 duplicate, or n duplicates, or n-1 duplicates?
n-1
.
Also, if multiple copies of an asset are added/removed, should that be counted as one addition/removal or multiple, and should the total size of added/removed assets count the copies once or for each path changed?
sorry, don't grasp this one fully, but yes -- if multiple copies added/removed -- count only once. Once again -- operate on lists of checksums/keys not paths. Pretty much len(duplicates) = len(all_keys) - len(set(all_keys)))
; all the stats for diff come from set(all_keys)
in both states, so you ignore that they could have been "multiple" - we just report gross for them in "duplicates"
@yarikoptic So, to be clear, given a collections.Counter
instance keys1
counting all the occurrences of keys in the earlier commit and an instance keys2
for the later commit, the difference between the commits' distinct assets is reported as something like:
{
"added": len(keys2.keys() - keys1.keys()),
"added_size": sum(asset_size(k) for k in keys2.keys() - keys1.keys()),
"removed": # Like `added`, but with keys1 and keys2 reversed
"removed_size": # Like `added_size`, but with keys1 and keys2 reversed
}
But are the duplicates reported as just a delta between the total number of duplicates for each commit, like so:
duplicates1 = Counter({k: n-1 for k, n in keys1.items() if n > 1})
duplicates2 = Counter({k: n-1 for k, n in keys2.items() if n > 1})
return {
"delta": sum(duplicates2.values()) - sum(duplicates1.values()),
"delta_size": sum(asset_size(k) * n for k, n in duplicates2.values())
- sum(asset_size(k) * n for k, n in duplicates1.values()),
"remaining": sum(duplicates2.values()),
}
or as the sums of the number of increases & decreases for each key, like so:
added_duplicates = Counter({k: d for k,n in keys2.items() if (d := n - keys1[k]) > 0})
removed_duplicates = # as above, but with keys1 and keys2 reversed
return {
"added": sum(added_duplicates.values()),
"added_size": sum(asset_size(k) * n for k, n in added_duplicates),
"removed": # Like `added`, but for `removed_duplicates`
"removed_size": # Like `added_size`, but for `removed_duplicates`
"remaining": sum(n-1 for n in keys2.values() if n > 1),
}
?
seems to be right, and I think the former one would suffice for duplicates.
Purpose: produce reports for dandi archive dataset contributors, for reporting back to funders. Reporting periods could differ between datasets, thus just having a generic tool would be the best initial step to implement it.
Attn @satra, please provide your feedback on the following design -- anything to add?
Statistics should include
All of that information could be easily harvested by comparing .dandi/assets dump of two commits and git history (published versions inbetween those commits/dates, diff from latest published to current "draft").