datalad / datalad-metalad

Next generation metadata handling
Other
11 stars 11 forks source link

add annex sizes information to `annex` metadata #180

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

Could be quite useful to know how big the dataset is without installing it.

  1. We could easily include output of the git annex info call, such as (removing remotes here):

    $> git annex info --json --bytes | jq .
    {
    "local annex size": "12795203",
    "size of annexed files in working tree": "39538339451",
    ...
    "backend usage": {
    "SHA256E": 1445
    },
    "local annex keys": 3,
    "available local disk space": "101327969728 (+1000000 reserved)",
    "annexed files in working tree": 1445,
    }
  2. Most frequently of interest is the size of the dataset and all of its subdatasts, so we should aggregate that information, either during metadata aggregation, or "dynamically" somehow since all that information about data sizes in the subdatasets would be available

  3. This could be of relevance to https://github.com/datalad/datalad/issues/2403 . ATM web ui does similar sizes extraction, but does it also per each file/directory. If we maintain size information also per each file (for annexed - typically could be extracted from key which we already carry; for git - we would need it anyways) so it could be used to estimate also directories sizes "on the fly" (unless we eventually start providing metadata at directory level, which is feasible in many cases, such as subject info in BIDS per sub- directory)

mih commented 5 years ago

https://github.com/datalad/datalad-revolution/pull/84 now yields this for a complete dataset, and each file result also has size info.

...
    "datalad_core": {
      "@id": "d97455f592d5ad610efc3701d1479aff3513452b",
      "authors": [
        ...
        "Michael Hanke <michael.hanke@gmail.com>",
        ....
      ],
      "contentbytesize": 14511881778,
      "dateCreated": "2015-10-01T13:37:24+02:00",
      "dateModified": "2018-05-11T09:23:33+02:00",
      "distribution": [
        {
          "name": "mddatasrc",
          "url": "http://psydata.ovgu.de/studyforrest/phase2/.git"
        },
        {
          "name": "origin",
          "url": "https://github.com/psychoinformatics-de/studyforrest-data-phase2"
        }
      ],
      "hasPart": [
        {
          "name": "src/lab-eyetracking",
          "type": "Dataset"
        }
      ],
      "identifier": "5eaff716-54eb-11e8-803d-a0369f7c647e",
      "version": "0-75-ge9f5a08"
    },
...