An attribute (or dataset?) list software/library which produced that file/dataset etc

yarikoptic commented 4 years ago

I run into situations where it would have been useful to know which tool, and of what version, has produced a given .nwb file. I have looked into a h5dump of files and into the schema and found no indication that there is any field to store such information ATM. But such information could be very useful in many use cases. E.g.

to troubleshoot detected problems with the file,
to reproduce some a analysis (if data stored in .nwb is the result of running an analysis script)
to guarantee (where possible/needed) that the same version(s) of the tool(s) used to save the file so to not introduce tool/version specific changes.

E.g., in BIDS (neuroimaging world)

sidecar files produced by dcm2niix (upon conversion from DICOM) carry its name/version:

http://datasets.datalad.org/dbic/QA/sub-qa/ses-20180910/anat/sub-qa_ses-20180910_acq-MPRAGE_T1w.json

  "ConversionSoftware": "dcm2niix",
  "ConversionSoftwareVersion": "v1.0.20171215 (OpenJPEG build) GCC6.3.0",

and I have relied on that information already multiple times to assure that I do convert using the same version of dcm2niix in a given dataset. Note: this field is not standardized (yet) within BIDS itself, but it might be eventually superseded by the below:

in BIDS common derivatives enhancement proposal there is a provision for "PipelineDescription" record which would include "Description" and "Version". Note: I am yet to resurrect my claim that such information is of importance for BIDS raw datasets as well since they are also not "given by God" and some tool was used to produce them.

I do appreciate the fact that it would be difficult if not impossible to provide a complete and exhaustive provenance record for all software/versions which ever touched a file. But that is where IMHO some information would be better than none, and where nwb standard should provision at least some way to store such information in a standard way.

Edit (20210218):

I think this field/structure (or an additional one) should also incorporate analogous information on extension libraries used to produce this file and embed their namespaces.

bendichter commented 4 years ago

We have an optional /general/source_script, though it is rarely used.

https://nwb-schema.readthedocs.io/en/latest/format.html#id153

https://pynwb.readthedocs.io/en/stable/pynwb.file.html#pynwb.file.NWBFile

oruebel commented 4 years ago

@yarikoptic thanks for the suggestion. I agree that provenance is an important and non-trivial problem.

...there is a provision for "PipelineDescription" record which would include "D

Do you know whether BIDS is compliant with PROV?

yarikoptic commented 4 years ago

I would say "it is compatible" ;) as BIDS itself largely is not following semantic web standards and only very loosely ontologies, it is not "compliant" per se. But there are now efforts to provide a match from BIDS sidecar fields into proper terms/ontologies. So a machine and human readable BIDS record like following (an example from bids derivatives specification)

    "PipelineDescription": {
        "Name": "FMRIPREP",
        "Version": "1.2.5",
        "DockerHubContainerTag": "poldracklab/fmriprep:1.2.5"
        },

could be converted into the largely machine-only readable proper expanded PROV record. Hence "compatible".

I am yet to grasp PROV better myself, and to find and suggest a better generalization within BIDS itself: such record should be not only for "derivatives" datasets but for regular BIDS "raw" (since they are typically produced by some script/pipeline these days) as well. Also it should be not only at the dataset level since any particular file could be produced by a different tool (hence my example above of dcm2niix). But also, and relevant to nwb, it could be that it is not a single life-time event which produced that dataset/file. It could have been open/modified/saved-in-place. If we were to make it even more "compatible" with PROV, we would like to annotate also with generatedAtTime so overall the provenance record is likely to be a list of items pointing to all software which touched it. For IO libraries such as pynwb it might be useful to have some kind of a composite record which would mention the tandem of "pipeline" (e.g. "BrainStorm" record with its version) and IO library ("matnwb"? with its version), so it would be a list of records, possibly with duplicate information across them (that is where proper PROV graph might eventually even provide a more concise representation).

May be @satra or @dbkeator knows a most convenient, both human and machine readable convention which would be PROV-compatible (or may be somehow even some PROV serialization which could be used as is) which could be adopted here?

One nice IMHO example I found is the ASDF, the Adaptable Seismic Data Format (not sure if URL would work for you, let me know). It uses HDF5 for data storage and uses PROV (stored within HDF5, Figure 1) with their seis_prov ontology (Figure 2) to annotate provenance for all elements. So in that it has a nice analogy in that with neural data types in NWB, and might be worth looking closer.

t-b commented 4 years ago

Just for the record I'm using /general/source_script in one of my projects.

satra commented 4 years ago

@oruebel - unfortunately provenance is a word that's being avoided in the BIDS world right now. this may change, as bids moves from raw data to derived data, but the baggage and complexity of semantics and the lack of tooling is not seen favorably. so instead of using a formal model, they are creating key-value pairs to represent current transformations and their parameters.

in a related set of projects, we are using prov to augment what bids stores, while relying on keys and descriptors they are creating.

in the context of nwb, i can see yarik's use case as what produced this file, but does nwb allow data and derived data to exist in the same file? or would it create a new nwb file.

if the former, then metadata would need to be attached to a specific dataset inside the h5 file rather than at the file level itself.

in fact PROV would account for this quite easily.

dataset1 prov:wasDerivedFrom dataset2
dataset1 prov:wasGeneratedBy software/script

and in some recent work we are thinking of provenance as message passing, which would allow us to add single jsonld messages to a prov section of an h5 file if we wanted to. or track these outside in a prov store.

just as in the neuroimaging world, this will require external software to change, so this will be a slow process. but perhaps we can put some of this in to pynwb itself?

tgbugs commented 4 years ago

Chiming in on the PROV example. I have been using a similar prov:wasGeneratedBy pattern for tracking which scripts generate ontology files, however I recently took a closer look at how those predicates are 'supposed' to be used, and discovered that there isn't any simple way to link source code directly to a file it produced, it has to go through a prov:SoftwareAgent which adds complexity to the model (or you conflate the source code for a script with the process of executing that source). The lack of tooling around prov means that this is an easy conflation to make. I personally have no issue with it, but it could cause tools to choke down the road.

On the question of embedding PROV. There are cases where it is impossible to embed all of the PROV information about a file in the file itself (e.g. if you care about start and end times for the process generating that file). This is not particularly surprising and would be a problem with any system, but does mean that the external prov store really is required if you really want to track all the information since it can't go in the file itself. Ultimately this may not matter if there is a way to attach all the relevant prov to internal structures in the file and treat the final creation of the file as an event that doesn't need an external record.

On Sun, Oct 20, 2019 at 10:10 AM Satrajit Ghosh notifications@github.com wrote:

@oruebel https://github.com/oruebel - unfortunately provenance is a word that's being avoided in the BIDS world right now. this may change, as bids moves from raw data to derived data, but the baggage and complexity of semantics and the lack of tooling is not seen favorably. so instead of using a formal model, they are creating key-value pairs to represent current transformations and their parameters.

in a related set of projects, we are using prov to augment what bids stores, while relying on keys and descriptors they are creating.

in the context of nwb, i can see yarik's use case as what produced this file, but does nwb allow data and derived data to exist in the same file? or would it create a new nwb file.

if the former, then metadata would need to be attached to a specific dataset inside the h5 file rather than at the file level itself.

in fact PROV would account for this quite easily.

dataset1 prov:wasDerivedFrom dataset2 dataset1 prov:wasGeneratedBy software/script

and in some recent work we are thinking of provenance as message passing, which would allow us to add single jsonld messages to a prov section of an h5 file if we wanted to. or track these outside in a prov store.

just as in the neuroimaging world, this will require external software to change, so this will be a slow process. but perhaps we can put some of this in to pynwb itself?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NeurodataWithoutBorders/nwb-schema/issues/319?email_source=notifications&email_token=ABAZYAH3OUZIUZPCLNQLT7TQPSGJXA5CNFSM4I755I32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYO2AY#issuecomment-544271619, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZYAB2EYOQ7N2M7P24273QPSGJXANCNFSM4I755I3Q .

satra commented 4 years ago

@tgbugs - yes, if you want the full scope of prov the simple statements only get you part of the way. you will have to insert activities and agents and plans to get to full provenance. this can still be done in a message passing way. we have implemented this in our new workflow engine pydra.

about embedding, it really depends. this is the usual tussle between header and blobs or like trying to store a shasum of a file in a file. thus, if i were a purist, i would store the provenance outside of the file. but as long as the prov records are about datasets inside the h5 file, the prov records could also be in the file. if the prov record was about the file itself, then it can get impossible just like the shasum example!

oruebel commented 4 years ago

Thanks for all the helpful feedback. I'll take a look at this issue next after SfN is over.

yarikoptic commented 4 years ago

Until details for proper provenance recording within NWB worked out, I guess we (users) will try to (ab)use /general/source_script field to populate with sufficient information to possibly later convert to proper provenance record. E.g. see https://github.com/SpikeInterface/spikeextractors/issues/290#issuecomment-549490197 for some ad-hoc ideas

yarikoptic commented 4 years ago

Do you think, something like "NWB:N PROV" would be a worthwhile project to propose for the Allen NWB hackathon? Who would be interested to join the effort to push it forward? ;-)

oruebel commented 4 years ago

I think this is a topic that needs a bit more broader discussion, so a breakout session may be a good start to have at the hackathon. @yarikoptic feel free to create a project page on the hackathon repo (instructions are here https://neurodatawithoutborders.github.io/nwb_hackathons/HCK07_2020_Seattle/projects/ )

EnricoScanta commented 4 years ago

I'd like to resurrect this topic. As mentioned, e.g. for sharing data across labs and working collaboratively, it is important to know whether one is using exactly the same data file or a different revision. As discussed, a difference might come from the version of the package or a different script used, but it might even be that the file has been re-created just to to fix or add some metadata, using the same software. For this reason we store in our (not-yet-nwb2) files, a "data release number". This number corresponds to a specific set of software packages, api, scripts, raw data and fixes used. We agree that is would be better to store all this info extensively in the file, but at least the lab who generated the file would/should keep track of its releases. Thus we suggest to add this data_release_number field also in NWB2 files.

yarikoptic commented 3 years ago

I have added to original description an additional element to consider: listing of extensions (and their versions) pertinent to the file. This should also be useful for client tools to be able to identify that they might need to import some extensions before attempting to load the file (without load_namespaces=True I guess). ref: https://github.com/dandi/dandi-cli/issues/395

I think it is not productive (as time showed) to wait to come up with some ultimate "provenance" solution, and rather provide some pragmatically usable/useful one meanwhile. Pretty much I think we need wasGeneratedBy attribute which would be a list containing records with fields

type: {"io", "extension"}
name: e.g. "pynwb" or "allensdk.brain_observatory.ecephys.nwb"
version: e.g. 1.4.1

any other immediately useful/available information pynwb or extension(s) could add?

oruebel commented 3 years ago

listing of extensions (and their versions) pertinent to the file

Since the schema are cached in the file, this information should already be available.

yarikoptic commented 3 years ago

cool! could you please provide a code snippet to get such extensions listing (ideally with their versions)?

oruebel commented 3 years ago

I think it essentially comes (more-or-less) down to pulling https://github.com/hdmf-dev/hdmf/blob/dev/src/hdmf/backends/hdf5/h5tools.py#L136-L176 into a separate function so that you can read all the namespaces from a file without loading them (i.e., without registering with the TypeMap). @rly said he was going to take a look.

yarikoptic commented 3 years ago

just to make sure -- given a list of namespaces, would I be able to tell which of them are of extensions (and not of pynwb or hdmf), and (ideally) which extensions in particular?

oruebel commented 3 years ago

I believe yes. There are only two namespaces that come from NWB: core and hdmf-common, all other namespaces should be extensions. In addition, all extension namespaces should have the prefix ndx per our nameing conventions.

yarikoptic commented 3 years ago

Dear @rly, let me know what/when I could try ;)

rly commented 3 years ago

Without going into provenance models, I think it would be useful for reproducibility of a file to have a 1-dimensional array attribute on NWB file where software can add a list of all the software and their dependencies used to generate the file. In particular, we can list that PyNWB 1.4.0 was used. But that's enough because PyNWB is not pinned to particular dependency versions. So we should list HDMF 2.4.0, h5py 2.3.0, numpy x.y.z, Python 3.8.5, etc. And probably also some information about the OS in the same or separate field.

This attribute could have shape (N, 2) or have shape (N, 1) with compound data type (name: text, version: text).

rly commented 3 years ago

We should also update the PyNWB version scheme to include the git hash so that data generated off of the dev branch can be traced back to its code and the schema used if it is not cached.

satra commented 3 years ago

can the attribute be a dictionary {"name": version, ...}

rly commented 3 years ago

NWB does not currently support an arbitrary number of arbitarily named attributes on a group/dataset. If the keys and values are all scalar text values, then the dictionary can be schematized as a dataset of dtype string and shape (N, 2), where first column is key and second column is value. It would also be possible to store it as a JSON string scalar attribute or dataset.

t-b commented 3 years ago

This attribute could have shape (N, 2) or have shape (N, 1) with compound data type (name: text, version: text).

In MIES we are storing a custom Nx2 text dataset a la

grafik

yarikoptic commented 3 months ago

Example of what we have for an example asset in 000027

    "wasGeneratedBy": [
        {
            "id": "urn:uuid:e664a279-71dc-4290-9bb1-1c843e2b19cf",
            "name": "Metadata generation",
            "schemaKey": "Activity",
            "description": "Metadata generated by DANDI cli",
            "wasAssociatedWith": [
                {
                    "url": "https://github.com/dandi/dandi-cli",
                    "name": "DANDI Command Line Interface",
                    "version": "0.21.0",
                    "schemaKey": "Software",
                    "identifier": "RRID:SCR_019009"
                }
            ]
        }
    ]

so attn @rly to possibly look into extending schema with supporting metadata records as expressive.

rly commented 2 months ago

@stephprince the TAB would like to prioritize the ability to store basic provenance information about the software/library + versions used to produce an NWB file or data object. This also follows our discussion at the Developer Hackathon. Could you take on proposing a resolution to move this forward before the planned July release?

stephprince commented 2 months ago

Yes, I can take this on. Are there any meeting notes available from the discussion at the Developer Hackathon?

rly commented 2 months ago

@stephprince here are the notes that I took:

Provenance - what is practical? what is reasonable? who is it for?

Computational provenance, e.g., Docker image - machine-readable, used to re-generate the data perfectly, but not necessarily interpretable, and difficult to store with the data

Scientific provenance - what are the inputs and computation at a high level?

Do we want to know the specific hardware used? The high-level packages used and their versions? The packages and all of their dependencies, e.g., the python environment?

Consensus: make it approachable and usable. Not like a complicated PROV graph that may be super descriptive but too complicated for the average user to make use of or store data in.

While methods exist to store provenance thoroughly outside of the data, e.g., through associating the data with docker images or other information in a database, there is value in having an easily shareable representation.

Where do we start? Don't need to get it perfect from the start.

What should be in the standard? Use cases:

Figure out which packages and versions were used to generate a dataset because a bug was found in a particular version of a package and you want to know whether this data was affected and needs to be re-generated. As long as the data is structured, users can write scripts to read it.

Search for all data in an archive that used a particular package, e.g., kilosort

Compare datasets. Apply one computational process on data in a different dataset. This would require computational provenance. Might be too far and not that common of a use case yet.

Ideas:

name of the software package, version, parameters used, the script or function called.

each NWB object can have this optional field

pynwb could call pip freeze automatically or recommend it in the tutorial.

Don't need to be perfect to start, but having something is better than nothing while we iterate.

yarikoptic commented 2 months ago

any my words of wisdom: "baby steps are better than no steps at all" ;-)

NeurodataWithoutBorders / nwb-schema

An attribute (or dataset?) list software/library which produced that file/dataset etc #319