datalad / datalad-metalad

Next generation metadata handling
Other
13 stars 11 forks source link

Extend `metalad_runprov` extractor with all run command properties #264

Open jsheunis opened 2 years ago

jsheunis commented 2 years ago

Unless I'm using this extractor incorrectly, it looks like metalad_runprov currently does not output properties related to the actual run command (such as cmd, input, output etc.

E.g. if we use this dataset: hcp_wm_preprocessed, with an example RUNCMD commit:

>> datalad clone https://github.com/datalad-datasets/hcp_wm_preprocessed.git
[INFO   ] scanning for annexed files (this may take some time)
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://github.com/datalad-datasets/hcp_wm_preprocessed.git/config download failed: Not Found
install(ok): /Users/jsheunis/Documents/psyinf/Data/hcp_wm_preprocessed (dataset)
>> cd hcp_wm_preprocessed
>> datalad meta-extract -d . metalad_runprov | jq .
{
  "type": "dataset",
  "dataset_id": "5a70db74-bd40-11ea-b9e8-a0369f287950",
  "dataset_version": "121398dfb72a87c093ac821179a5ba91cf955d0a",
  "extractor_name": "metalad_runprov",
  "extractor_version": "---",
  "extraction_parameter": {},
  "extraction_time": 1657655418.684576,
  "agent_name": "Stephan Heunis",
  "agent_email": "s.heunis@fz-juelich.de",
  "extracted_metadata": {
    "@context": "http://openprovenance.org/prov.jsonld",
    "@graph": [
      {
        "@id": "6d5f5a3e11b399a2988ca24657dad65ceb6d382d",
        "@type": "activity",
        "atTime": "2022-02-28T03:57:03+01:00",
        "prov:wasAssociatedWith": {
          "@id": "d3765bf6e3a68497b42584fa0774695e"
        },
        "rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset"
      },
      {
        "@id": "737fecac13767696d0aed0c864170b423cde3b15",
        "@type": "activity",
        "atTime": "2020-07-04T14:36:51+02:00",
        "prov:wasAssociatedWith": {
          "@id": "d3765bf6e3a68497b42584fa0774695e"
        },
        "rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset"
      },
      {
        "@id": "9292fb25fcf95abce0ec04c8a2e576750318a494",
        "@type": "activity",
        "atTime": "2020-07-04T19:38:17+02:00",
        "prov:wasAssociatedWith": {
          "@id": "d3765bf6e3a68497b42584fa0774695e"
        },
        "rdfs:comment": "[DATALAD RUNCMD] Import documentation from HCP dataset"
      },
      {
        "@id": "d3765bf6e3a68497b42584fa0774695e",
        "@type": "agent",
        "name": "Adina Wagner",
        "email": "adina.wagner@t-online.de"
      }
    ]
  }
}

This shows that the relevant information extracted include the @id (shasum), the @id of the agent that is associated with the commit (prov:wasAssociatedWith), and the commit message (rdfs:comment). But no information about the run command, inputs and outputs (see example full run command info below) are included in the extracted metadata:

[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fM…
…RI for the working memory task data.

Specifically, these are the files:

- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt
- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt
- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz
- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz
- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf

for each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "find .hcp/HCP1200/ -maxdepth 6 -path '*/MNINonLinear/Results/tfMRI_WM_*/EVs/*' -name '*.txt' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??_SBRef.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o  -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'Movement_Regressors.txt' | sed -e 's#\\(\\.hcp/HCP1200\\)\\(.*\\)#\\1\\2\\x00.\\2#' |  datalad copy-file -r --specs-from -",
 "dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "[0-9]*"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^

The ideal scenario for this extracted metadata to be maximally useful for datalad-catalog, would be for the extended info to also be included in the extractor output.

From a cursory glance at the code it looks like some of this might be parsed, or was intended to be parsed but not yet implemented. And not included in the extractor output.

jsheunis commented 2 years ago

Looked into the runprov.py code and saw this comment:

                    # TODO extend message with formatted run record
                    # targeted for human consumption (but consider
                    # possible leakage of information from sidecar
                    # runrecords)

This would explain the absence of run record information. I'm interested in understanding the threat w.r.t. information leakage.

The run record info could be added easily (here assuming no intermediate formatting) to the graph item (see code), e.g. :

                graph.append({
                    '@id': actsha,
                    '@type': 'activity',
                    'atTime': rec['commit_date'],
                    'prov:wasAssociatedWith': {
                        '@id': agent_id,
                    },
                    # TODO extend message with formatted run record
                    # targeted for human consumption (but consider
                    # possible leakage of information from sidecar
                    # runrecords)
                    'rdfs:comment': rec['message'],
                    'run_record': rec['run_record'],
                })

and this then outputs the expected information when the extractor is run with metalad:

{
    "type": "dataset",
    "dataset_id": "5a70db74-bd40-11ea-b9e8-a0369f287950",
    "dataset_version": "121398dfb72a87c093ac821179a5ba91cf955d0a",
    "extractor_name": "metalad_runprov",
    "extractor_version": "---",
    "extraction_parameter": {},
    "extraction_time": 1661890701.939269,
    "agent_name": "Stephan Heunis",
    "agent_email": "s.heunis@fz-juelich.de",
    "extracted_metadata": {
        "@context": "http://openprovenance.org/prov.jsonld",
        "@graph": [
            {
                "@id": "6d5f5a3e11b399a2988ca24657dad65ceb6d382d",
                "@type": "activity",
                "atTime": "2022-02-28T03:57:03+01:00",
                "prov:wasAssociatedWith": {
                    "@id": "d3765bf6e3a68497b42584fa0774695e"
                },
                "rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset",
                "run_record": {
                    "chain": [
                        "737fecac13767696d0aed0c864170b423cde3b15"
                    ],
                    "cmd": "find .hcp/HCP1200/ -maxdepth 6 -path '*/MNINonLinear/Results/tfMRI_WM_*/EVs/*' -name '*.txt' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??_SBRef.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o  -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'Movement_Regressors.txt' | sed -e 's#\\(\\.hcp/HCP1200\\)\\(.*\\)#\\1\\2\\x00.\\2#' |  datalad copy-file -r --specs-from -",
                    "dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
                    "exit": 0,
                    "extra_inputs": [],
                    "inputs": [],
                    "outputs": [
                        "[0-9]*"
                    ],
                    "pwd": "."
                }
            },
            {
                "@id": "737fecac13767696d0aed0c864170b423cde3b15",
                "@type": "activity",
                "atTime": "2020-07-04T14:36:51+02:00",
                "prov:wasAssociatedWith": {
                    "@id": "d3765bf6e3a68497b42584fa0774695e"
                },
                "rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset",
                "run_record": {
                    "chain": [],
                    "cmd": "find .hcp/HCP1200/ -maxdepth 6 -path '*/MNINonLinear/Results/tfMRI_WM_*/EVs/*' -name '*.txt' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??_SBRef.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o  -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'Movement_Regressors.txt' | sed -e 's#\\(\\.hcp/HCP1200\\)\\(.*\\)#\\1\\2\\x00.\\2#' |  datalad copy-file -r --specs-from -",
                    "dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
                    "exit": 0,
                    "extra_inputs": [],
                    "inputs": [],
                    "outputs": [
                        "[0-9]*"
                    ],
                    "pwd": "."
                }
            },
            {
                "@id": "9292fb25fcf95abce0ec04c8a2e576750318a494",
                "@type": "activity",
                "atTime": "2020-07-04T19:38:17+02:00",
                "prov:wasAssociatedWith": {
                    "@id": "d3765bf6e3a68497b42584fa0774695e"
                },
                "rdfs:comment": "[DATALAD RUNCMD] Import documentation from HCP dataset",
                "run_record": {
                    "chain": [],
                    "cmd": "datalad copy-file -t . {inputs}",
                    "dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
                    "exit": 0,
                    "extra_inputs": [],
                    "inputs": [
                        ".hcp/*.md"
                    ],
                    "outputs": [
                        "*.md"
                    ],
                    "pwd": "."
                }
            },
            {
                "@id": "d3765bf6e3a68497b42584fa0774695e",
                "@type": "agent",
                "name": "Adina Wagner",
                "email": "adina.wagner@t-online.de"
            }
        ]
    }
}

This would be useful for the catalog, since there I'd like to have a visual representation of what command was run on which input state of a dataset to generate which output state of said dataset. But the questions is, would adding this info to the extractor open up some security issue that I am not fully appreciating?

PS, we should probably upgrade the runprov extractor to make use of the more recent DatasetMetadataExtractor class

mih commented 2 years ago

I think the comment was motivated by the fact that run records could have been intentionally placed into sidecar files, because they can be annex'ed (rather than being unconditionally included in the repo history), and therefore live in a protected-access zone. If a generic extract would unconditionally pull them out, it would torpedo such a setup.

We presently have no concept of "this is sensitive information" in metalad. If we would, this could be the condition. Until then, I think such a feature needs to be documented and made configurable.

jsheunis commented 2 years ago

Thanks, this made me learn about sidecar files for run-records :)

Thinking about metalad usage for provenance record extraction from commit messages AND/OR sidecar files, it will be difficult to derive what security concerns the run-record creator had in mind (if any) when creating the commit (e.g. did they annex it for a reason or just because they annex everything in the dataset). So my guess is that the instruction of whether to extract detailed run-records or not would need to be specified by the user/process running extraction (as opposed to some guestimate logic)? IMO this could be done with a new metalad parameter or with new and separate extractor.

Thoughts? also tagging in @christian-monch

jsheunis commented 2 years ago

Another property that would be useful for the catalog is the SHA of the commit preceding the run-commit, i.e. its parent. This is not currently provided by the runprov extractor and a few lines would have to be added to get hold of and output that.

yarikoptic commented 1 year ago

ATM I see no run command properties exposed at all:

 ~datalad/datalad-extensions  master ▓▒░─
❯ datalad meta-extract metalad_runprov README.md | jq .
{
  "type": "file",
  "dataset_id": "6b923cfa-a6c6-4bae-941d-e92f6afd5fcb",
  "dataset_version": "8d68e91bf1a65e529c6084c6a3f80dd624106c60",
  "path": "README.md",
  "extractor_name": "metalad_runprov",
  "extractor_version": "---",
  "extraction_parameter": {},
  "extraction_time": 1682451663.8101132,
  "agent_name": "Yaroslav Halchenko",
  "agent_email": "debian@onerussian.com",
  "extracted_metadata": {
    "@id": "datalad:SHA1-s26373--a763b07b43fb2670ca5174e27025a506a9f11877",
    "@type": "entity",
    "prov:wasGeneratedBy": {
      "@id": "6af1c9b88222b5ae91d9c091952083a27813272a"
    }
  }
}

whenever run record has inputs specified. But I also wondered, shouldn't this extractor also follow the inputs and extract /include/associate somehow their PROV since that is what establishes their full provenance.