Open jsheunis opened 2 years ago
Looked into the runprov.py
code and saw this comment:
# TODO extend message with formatted run record
# targeted for human consumption (but consider
# possible leakage of information from sidecar
# runrecords)
This would explain the absence of run record information. I'm interested in understanding the threat w.r.t. information leakage.
The run record info could be added easily (here assuming no intermediate formatting) to the graph item (see code), e.g. :
graph.append({
'@id': actsha,
'@type': 'activity',
'atTime': rec['commit_date'],
'prov:wasAssociatedWith': {
'@id': agent_id,
},
# TODO extend message with formatted run record
# targeted for human consumption (but consider
# possible leakage of information from sidecar
# runrecords)
'rdfs:comment': rec['message'],
'run_record': rec['run_record'],
})
and this then outputs the expected information when the extractor is run with metalad:
{
"type": "dataset",
"dataset_id": "5a70db74-bd40-11ea-b9e8-a0369f287950",
"dataset_version": "121398dfb72a87c093ac821179a5ba91cf955d0a",
"extractor_name": "metalad_runprov",
"extractor_version": "---",
"extraction_parameter": {},
"extraction_time": 1661890701.939269,
"agent_name": "Stephan Heunis",
"agent_email": "s.heunis@fz-juelich.de",
"extracted_metadata": {
"@context": "http://openprovenance.org/prov.jsonld",
"@graph": [
{
"@id": "6d5f5a3e11b399a2988ca24657dad65ceb6d382d",
"@type": "activity",
"atTime": "2022-02-28T03:57:03+01:00",
"prov:wasAssociatedWith": {
"@id": "d3765bf6e3a68497b42584fa0774695e"
},
"rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset",
"run_record": {
"chain": [
"737fecac13767696d0aed0c864170b423cde3b15"
],
"cmd": "find .hcp/HCP1200/ -maxdepth 6 -path '*/MNINonLinear/Results/tfMRI_WM_*/EVs/*' -name '*.txt' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??_SBRef.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'Movement_Regressors.txt' | sed -e 's#\\(\\.hcp/HCP1200\\)\\(.*\\)#\\1\\2\\x00.\\2#' | datalad copy-file -r --specs-from -",
"dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"[0-9]*"
],
"pwd": "."
}
},
{
"@id": "737fecac13767696d0aed0c864170b423cde3b15",
"@type": "activity",
"atTime": "2020-07-04T14:36:51+02:00",
"prov:wasAssociatedWith": {
"@id": "d3765bf6e3a68497b42584fa0774695e"
},
"rdfs:comment": "[DATALAD RUNCMD] Assemble HCP dataset subset for preprocessed task fMRI for the working memory task data.\n\nSpecifically, these are the files:\n\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/EVs/*.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/Movement_Regressors.txt\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_??_SBRef.nii.gz\n- <sub>/MNINonLinear/Results/tfMRI_WM_*/tfMRI_WM_*.fsf\n\nfor each participant. The structure of the directory tree and file names are kept identical to the full HCP dataset",
"run_record": {
"chain": [],
"cmd": "find .hcp/HCP1200/ -maxdepth 6 -path '*/MNINonLinear/Results/tfMRI_WM_*/EVs/*' -name '*.txt' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_??_SBRef.nii.gz' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'tfMRI_WM_*.fsf' -o -path '*/MNINonLinear/Results/tfMRI_WM_*/*' -name 'Movement_Regressors.txt' | sed -e 's#\\(\\.hcp/HCP1200\\)\\(.*\\)#\\1\\2\\x00.\\2#' | datalad copy-file -r --specs-from -",
"dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
"exit": 0,
"extra_inputs": [],
"inputs": [],
"outputs": [
"[0-9]*"
],
"pwd": "."
}
},
{
"@id": "9292fb25fcf95abce0ec04c8a2e576750318a494",
"@type": "activity",
"atTime": "2020-07-04T19:38:17+02:00",
"prov:wasAssociatedWith": {
"@id": "d3765bf6e3a68497b42584fa0774695e"
},
"rdfs:comment": "[DATALAD RUNCMD] Import documentation from HCP dataset",
"run_record": {
"chain": [],
"cmd": "datalad copy-file -t . {inputs}",
"dsid": "5a70db74-bd40-11ea-b9e8-a0369f287950",
"exit": 0,
"extra_inputs": [],
"inputs": [
".hcp/*.md"
],
"outputs": [
"*.md"
],
"pwd": "."
}
},
{
"@id": "d3765bf6e3a68497b42584fa0774695e",
"@type": "agent",
"name": "Adina Wagner",
"email": "adina.wagner@t-online.de"
}
]
}
}
This would be useful for the catalog, since there I'd like to have a visual representation of what command was run on which input state of a dataset to generate which output state of said dataset. But the questions is, would adding this info to the extractor open up some security issue that I am not fully appreciating?
PS, we should probably upgrade the runprov
extractor to make use of the more recent DatasetMetadataExtractor
class
I think the comment was motivated by the fact that run records could have been intentionally placed into sidecar files, because they can be annex'ed (rather than being unconditionally included in the repo history), and therefore live in a protected-access zone. If a generic extract would unconditionally pull them out, it would torpedo such a setup.
We presently have no concept of "this is sensitive information" in metalad. If we would, this could be the condition. Until then, I think such a feature needs to be documented and made configurable.
Thanks, this made me learn about sidecar files for run-records :)
Thinking about metalad usage for provenance record extraction from commit messages AND/OR sidecar files, it will be difficult to derive what security concerns the run-record creator had in mind (if any) when creating the commit (e.g. did they annex it for a reason or just because they annex everything in the dataset). So my guess is that the instruction of whether to extract detailed run-records or not would need to be specified by the user/process running extraction (as opposed to some guestimate logic)? IMO this could be done with a new metalad parameter or with new and separate extractor.
Thoughts? also tagging in @christian-monch
Another property that would be useful for the catalog is the SHA of the commit preceding the run-commit, i.e. its parent. This is not currently provided by the runprov extractor and a few lines would have to be added to get hold of and output that.
ATM I see no run command properties exposed at all:
~datalad/datalad-extensions master ▓▒░─
❯ datalad meta-extract metalad_runprov README.md | jq .
{
"type": "file",
"dataset_id": "6b923cfa-a6c6-4bae-941d-e92f6afd5fcb",
"dataset_version": "8d68e91bf1a65e529c6084c6a3f80dd624106c60",
"path": "README.md",
"extractor_name": "metalad_runprov",
"extractor_version": "---",
"extraction_parameter": {},
"extraction_time": 1682451663.8101132,
"agent_name": "Yaroslav Halchenko",
"agent_email": "debian@onerussian.com",
"extracted_metadata": {
"@id": "datalad:SHA1-s26373--a763b07b43fb2670ca5174e27025a506a9f11877",
"@type": "entity",
"prov:wasGeneratedBy": {
"@id": "6af1c9b88222b5ae91d9c091952083a27813272a"
}
}
}
whenever run record has inputs
specified. But I also wondered, shouldn't this extractor also follow the inputs and extract /include/associate somehow their PROV since that is what establishes their full provenance.
Unless I'm using this extractor incorrectly, it looks like
metalad_runprov
currently does not output properties related to the actual run command (such ascmd
,input
,output
etc.E.g. if we use this dataset: hcp_wm_preprocessed, with an example RUNCMD commit:
This shows that the relevant information extracted include the
@id
(shasum), the@id
of the agent that is associated with the commit (prov:wasAssociatedWith
), and the commit message (rdfs:comment
). But no information about the run command, inputs and outputs (see example full run command info below) are included in the extracted metadata:The ideal scenario for this extracted metadata to be maximally useful for
datalad-catalog
, would be for the extended info to also be included in the extractor output.From a cursory glance at the code it looks like some of this might be parsed, or was intended to be parsed but not yet implemented. And not included in the extractor output.