dandi / dandisets

749 Dandisets, 813.7 TB total. DataLad super-dataset of all Dandisets from https://github.com/dandisets
10 stars 0 forks source link

A script to produce the "diff report" #83

Closed yarikoptic closed 3 years ago

yarikoptic commented 3 years ago

Purpose: produce reports for dandi archive dataset contributors, for reporting back to funders. Reporting periods could differ between datasets, thus just having a generic tool would be the best initial step to implement it.

Attn @satra, please provide your feedback on the following design -- anything to add?


get-upload-stats [--from DATE1] [--to DATE2] [-o FILE] [-f FORMAT] DANDISET

Produce changes stats for a given dandiset, optionally within a given period of time.

--from DATE1  - the earliest in the history commit which has AuthorDate at/after DATE1 will be taken as the beginning of the history to consider. If not specified - considers from the first commit.
--to     DATE2 - the latest in the history commit which has AuthorDate before DATE2 will be taken as the end of the history to consider. If not specified - takes latest commit (HEAD).
-o FILE - filename where to store it. If not specified - stdout
-f FORMAT - format to generate in. Default - markdown. Options: json, yaml

Statistics should include

All of that information could be easily harvested by comparing .dandi/assets dump of two commits and git history (published versions inbetween those commits/dates, diff from latest published to current "draft").

jwodder commented 3 years ago

@yarikoptic

number and size of added, changed (assuming path identity), or removed assets

Do you want the total sizes of all added/changed/removed assets, or a list of each asset's size, or a mapping from path to size, or what?

if no changes since the last published version

I assume that by "last published" you mean "last published between --from and --to". What if there is no published version within the given timeframe?

jwodder commented 3 years ago

@yarikoptic Also, should commits containing old-format .dandi/asset.json files be ignored by this script?

yarikoptic commented 3 years ago

re asset.json "format", forgot that things might have changed etc. May be we should just use file listing from git/annex -- it has checksums (identity), so we can also identify renames -- we should not claim deleted/added for those but rather just that -- renamed (if checksum stays the same).

it should be a summary -- so no individual file names in the report -- just total counts/sizes.

if there is no published version in the period, just state that since all changes reported above would be applicable to "draft".

jwodder commented 3 years ago

@yarikoptic Should this script only count NWB files or all annexed files? Or all files, both annexed and non-annexed (excluding those under dot directories)?

yarikoptic commented 3 years ago

Exclude .datalad, count everything else. Later we might decide to provide per file type stats but might not be needed etc

jwodder commented 3 years ago

@yarikoptic

yarikoptic commented 3 years ago

I put us in the corner of the --sync again (https://github.com/dandi/dandi-cli/pull/635) ;) but here it is at least simpler, since we can completely ignore file paths, and operate on checksums/annex keys of the files. So, let's forget about "renames" (path dependent), and just report on number/amount of added/removed assets (as content), and only report # and size of duplicates (paths reusing the same keys). So:

@yarikoptic

  • If an asset is renamed and a different asset is added to the original path, how should that be reported?

renamed -- changes nothing unless it replaced some existing one (then it is effectively 'delete' of that being replaced). added -- if content is new -- add to "added". If not new -- add to "duplicates"

  • How should multiple copies of the same asset in the same revision be handled?
    In particular, if a copy of an asset is made between revisions, how should that be reported? What if two copies are made and the original is deleted — is that a rename and a copy, two copies and a removal, or something else?

copy -- adds to duplicates.

1(original) +2 (copies) - 1 (original) = 2 copies. so nothing new added, duplicates increased.

I hope this makes sense now.

jwodder commented 3 years ago

@yarikoptic

only report # and size of duplicates (paths reusing the same keys)

jwodder commented 3 years ago

@yarikoptic Also, if multiple copies of an asset are added/removed, should that be counted as one addition/removal or multiple, and should the total size of added/removed assets count the copies once or for each path changed?

yarikoptic commented 3 years ago

yet another good point. Let's just report duplicates "changes" as "diff" between numbers of how many duplicates were present in first and 2nd state. So duplicates could be added and removed. could also state how many duplicates are actually left.

  • If the same key occurs at n>1 paths, is that 1 duplicate, or n duplicates, or n-1 duplicates?

n-1.

Also, if multiple copies of an asset are added/removed, should that be counted as one addition/removal or multiple, and should the total size of added/removed assets count the copies once or for each path changed?

sorry, don't grasp this one fully, but yes -- if multiple copies added/removed -- count only once. Once again -- operate on lists of checksums/keys not paths. Pretty much len(duplicates) = len(all_keys) - len(set(all_keys))); all the stats for diff come from set(all_keys) in both states, so you ignore that they could have been "multiple" - we just report gross for them in "duplicates"

jwodder commented 3 years ago

@yarikoptic So, to be clear, given a collections.Counter instance keys1 counting all the occurrences of keys in the earlier commit and an instance keys2 for the later commit, the difference between the commits' distinct assets is reported as something like:

{
    "added": len(keys2.keys() - keys1.keys()),
    "added_size": sum(asset_size(k) for k in keys2.keys() - keys1.keys()),
    "removed":  # Like `added`, but with keys1 and keys2 reversed
    "removed_size":  # Like `added_size`, but with keys1 and keys2 reversed
}

But are the duplicates reported as just a delta between the total number of duplicates for each commit, like so:

duplicates1 = Counter({k: n-1 for k, n in keys1.items() if n > 1})
duplicates2 = Counter({k: n-1 for k, n in keys2.items() if n > 1})

return {
    "delta": sum(duplicates2.values()) - sum(duplicates1.values()),
    "delta_size": sum(asset_size(k) * n for k, n in duplicates2.values())
                - sum(asset_size(k) * n for k, n in duplicates1.values()),
    "remaining": sum(duplicates2.values()),
}

or as the sums of the number of increases & decreases for each key, like so:

added_duplicates = Counter({k: d for k,n in keys2.items() if (d := n - keys1[k]) > 0})
removed_duplicates =  # as above, but with keys1 and keys2 reversed

return {
    "added": sum(added_duplicates.values()),
    "added_size": sum(asset_size(k) * n for k, n in added_duplicates),
    "removed":  # Like `added`, but for `removed_duplicates`
    "removed_size":  # Like `added_size`, but for `removed_duplicates`
    "remaining": sum(n-1 for n in keys2.values() if n > 1),
}

?

yarikoptic commented 3 years ago

seems to be right, and I think the former one would suffice for duplicates.