Possible data provenance functionality

Datseris commented 4 years ago

With @JonasIsensee , @tamasgal and @sebastianpech we discussed that smaller groups of scientists may not find it sensible to opt for large data management software such as CaosDB. But it would still be great to have basic data provenance for forms out data outside .bson.

.bson and similar formats are covered satisfactorily by DrWatson due to the automatic adding of git info, and the automatic adding of source file that generated them. This is not possible for e.g. figures or CSV files.

What could be possible is to have a central file, next to Project.toml, that is also .toml or .yml based, and works as a dictionary. It maps unique identifiers to a set of properties, the first of which is file, and it just contains the file path relative to the project main folder. The advantage of using toml is that it is human readable and can be searched with Ctrl+F. Notice that specialized parameter searches are more suited for the result of a function like collect_data and thus do not need to be considered for this functionality.

Other properties could be added, like source file used, date produced, savename of parameters used, author, git commit, etc.

All in all this is a great compromise between the complexity of a full data manager and having data provenance for figures, CSV, etc.

Datseris commented 4 years ago

Actually, I don't immediatelly see why the mapping should map unique identifiers. Seems to me that the format could directly map a file name (with its relative path) to the dictionary. The file name is also unique after all.

sebastianpech commented 4 years ago

The advantage of using toml is that it is human readable and can be searched with Ctrl+F

I think one advantage of using a binary file format is that we can attach julia types as metadata. So I could theoretically directly attach the parameter config dict that led to this specific file, instead of converting it into a string before.

The search functionality must the be of course implemented in DrWatson.

Actually, I don't immediatelly see why the mapping should map unique identifiers. Seems to me that the format could directly map a file name (with its relative path) to the dictionary. The file name is also unique after all.

I think using the filenames as identifiers is fine. It also suggests that the database file is only used for storing metadata for files and not for storing arbitrary data entries.

Datseris commented 4 years ago

I think one advantage of using a binary file format is that we can attach julia types as metadata. So I could theoretically directly attach the parameter config dict that led to this specific file, instead of converting it into a string before.

I thought about this as well, but it has a significant downside: filesize will explode quickly...? We should compare. In fact, if we use a central .bson file as the provenance, we can do this provenance thingy pretty much immediately. Writing a Julia function that does this isn't a big deal...

The search functionality must the be of course implemented in DrWatson.

This is really hard to do though, and probably not worth the effort. Searching withing dictionaries of arbitrary type is also dubious; if a user gives "p" you have to search all keys, and all values that could potentially include "p" or :p, and fields of custom types as well. Too complicated I feel, and would be a pain in the butt to debug for all possible use cases.

Datseris commented 4 years ago

See #152 for a quick and dirty sketch of the idea.

sebastianpech commented 4 years ago

Writing a Julia function that does this isn't a big deal...

Definitely not too difficult to do. Fitting it into the DrWatson workflow is a little harder. So one fundamental question: Would this functionality replace savename?

Datseris commented 4 years ago

Would this functionality replace savename?

What never! I use savename for figure titles :D

sebastianpech commented 4 years ago

What never! I use savename for figure titles :D

Clever :)

So we would promote two approaches that have a similar purpose

The current one. With savename, tagsave, ... that only fully works if you can store the metadata alongside you results. (Though, savename is universal, that's what's so nice about it. So storing the parameter set can always be done, no matter the file type)
Storing all info about the simulation in the central database file

Datseris commented 4 years ago

So we would promote two approaches that have a similar purpose

yeap, precisely.

JonasIsensee commented 4 years ago

Before we go off talking about implementation details, I believe we should think clearly about what we actually want from this software and what we need it to truly add value to the workflow. (or the reproducibility.) For example: The filename may be unique enough to identify a file but it can't tell if the file has been modified / overwritten. In that case hashvalues would be good.

Also - how can we do this to keep it as extensible as possible?

2. Storing all info about the simulation in the central database file
Saving into a central file is dangerous when generating data on multiple workers in parallel. I think collecting the metadata into a database after the fact would be safer.

Datseris commented 4 years ago

I think collecting the metadata into a database after the fact would be safer.

Yeah, but that is why the external-server-complicated-detached database software like CaosDB exist. Like you said, we really have to consider what we want to do. I think we all agree that we don't want this to be leading into any heavy dependencies....

We should also discuss whether we want to match existing functionalities. Personally, I don't see a reason to try and match the complexity (and capabilities) of those data management software, as there are several software that provide such options. The same thing holds for being able to tell if a file is modified / overwritten: it again can be managed by these advanced software.

It is also a matter of effort: I definitely can't spend a lot on time on this.

sebastianpech commented 4 years ago

I believe we should think clearly about what we actually want from this software and what we need it to truly add value to the workflow. (or the reproducibility.)

Yes. DrWatson is all about making life easy, so let's start there.

Saving into a central file is dangerous when generating data on multiple workers in parallel. I think collecting the metadata into a database after the fact would be safer.

Good point. Also for me it's not 100% clear when to save to memory and when to actually write the database file. This can become pretty complex. How are we dealing with large IO operations? Can they occur?

Just an idea: What about not having a single file, but one file for each file in the folder structure (so similar to git). The could be in a folder thats also in the .gitignore. For the user I makes no difference.

Datseris commented 4 years ago

Just an idea: What about not having a single file, but one file for each file in the folder structure (so similar to git). The could be in a folder thats also in the .gitignore. For the user I makes no difference.

How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.

sebastianpech commented 4 years ago

How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.

Fair point. So no BSON format then either.

JonasIsensee commented 4 years ago

How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.

I'm not sure I agree on this. We have collect_results. I think it would be an option to use collect_results to aggregate the metadata into DataFrames. That should make it searchable and human readable. (And there are multiple Julia packages to help with displaying these in electron windows or the browser) In that case it also shouldn't matter whether we put the metadata right next to the real data or into a separate folder-tree.

Or as a completely separate alternative one could find out if something like caosdb can be made much more accessible. Provide a binary wrapper package where you just call caosdb() to start such a container. (Last time I tried it, they had fully functional setups inside docker)

sebastianpech commented 4 years ago

So just to wrap this up, what options are we currently talking about.

single file vs. multiple files
plain text vs. binary
external database

single file: As @JonasIsensee pointed out, a single file is tricky when it comes to running simulations in parallel.

multiple files: I kind of like this approach because you don't have to write large junks of data every time you add new metadata to a file. Also it support parallelism. It's for me also currently the only option to store commit info, as I pointed out in https://github.com/JuliaDynamics/DrWatson.jl/issues/153.

plain text: I get that it's nice to be able to search in it with just a text editor. Also If I have multiple files I can do that by using grep, ripgrep, ag, ... or just cat all the files and pipe the output in a new file for searching.

binary file: Has the advantage of being not restricted to string representable metadata. Searching must be done from within DrWatson (eg. trough collect_results) or any extra piece of software that supports loading that format.

Datseris commented 4 years ago

I thought about this over night and here is my conclusion:

In many points of the documentation I've actively tried to point out that DrWatson is not a data manager. I honestly think this is a good idea, because there advanced and good data managers. What we are talking about here is making DrWatson a data manager. I don't have a problem with that, but we have to be aware that the competition in data management is very high: we would have to work really, really hard to make it as good as other data management software. Of course, we might not care to make it as good. But we would definitely care about making it sufficiently good, and given the existing complexity of data management, this will still be very hard.

There is CaosDB, which is good, and people tried hard to make it good, and did research on it etc. We are also lucky to personally know every member of the dev team. My opinion is to simply integrate CaosDB in Julia (because I don't know if it works now, I don't think so) and make it work well with DrWatson. DrWatson will become a dependency of CaosDB, not the other way around.

This way, if someone wants truly advanced data provenance, etc., they can use CaosDB. It is clear to me that a scientific project manager is always necessary, while a database isnt: you need to have a scientific project to get the data.

The point I am trying to make is that we should be careful to not re-invent the wheel. If you read the CaosDB paper, there are already hunderds of ideas on how to do data management.

Datseris commented 4 years ago

We can contact Alex and ask for help in the integration as well. @salexan2001

salexan2001 commented 4 years ago

Hi, I think that is a good idea and I will definitely help with the integration.

After lots of (internal) discussions about the structure of caosdb-client-libraries (R, Python, Julia, C++) we have come to the conclusion that a C++-Lib with high-level bindings to the other languages is the best option. We are currently working on this in this repository https://gitlab.com/caosdb/caosdb-cpplib and hopefully come up with a first release in the next months.
There is also much recent progress on a single-user docker container for CaosDB, so that it is possible to easily run the server on a single machine (which is probably very helpful for numerical simulations). @quazgar is working on this. Is there already a release scheduled?
Maybe interesting: We have just published our approach for a standardized file system layout here: https://doi.org/10.3390/data5020043 However, this approach is more focused on standardizing file systems in the absence of more specialized standards, so it might be too generic here. The advantage is, that the default CaosDB-crawler can understand the structure making the implementation simpler.
One advantage of CaosDB is that it does not really care about the structure of the files, so all of the above versions (single file, multiple files, ...) would work. You could in principle also mix these approaches or support multiple of these (in case there is no universal "best" option).

quazgar commented 4 years ago

Hi all!

There is also much recent progress on a single-user docker container for CaosDB, so that it is possible to easily run the server on a single machine (which is probably very helpful for numerical simulations). @quazgar is working on this. Is there already a release scheduled?

I am currently working on a Debian package indeed. The package will include a Docker image, sensible default configuration and a daemon script to start up a CaosDB-in-Docker instance in the background. Our rationale: We want to be as independent as possible from specific host system settings. Of course everyone is free to build a leaner package, if they find the time.

As for a tentative release schedule: I hope that we can name a date next week after checking what else is on our agenda. And I must say I am impressed by your plans and looking forward to seeing CaosDB used in DrWatson :smiley:

tamasgal commented 4 years ago

Sorry for coming late to the party. I'd like to show you something which is implemented in a framework I use sometimes and this is basically the outline of what we are currently aiming for in KM3NeT: https://cta-observatory.github.io/ctapipe/examples/provenance.html

The example above shows the "manual usage". I think such a provenance tracking could be easily hooked into existing DrWatson functions.

sebastianpech commented 4 years ago

@Datseris @tamasgal @JonasIsensee Out of curiosity, and maybe a bit boredome, I started coding a simple metadata and parallel simulation extension for DrWatson (https://github.com/sebastianpech/DrWatsonSim.jl).

I explain the two main use cases in the README, though I kinda don't like the simulation syntax yet, I will give it a try and will likely adapt it. I currently see the project more as a way of checking if such a functionality might improve my workflow.

About the implementation. Initially I just simply wanted to store bson files with metadata in a folder .metadata. However, as I was aiming for supporting parallel running jobs, I needed some locking mechanisms to generate unique ids and also update the index without having race conditions (The index is used for drastically improving the querying speed). The locking works surprisingly well. Only one detached process is allowed to update the index and get a new id, and multiple detached processes are allowed to read, unless one process is writing. Even in the worst cases I could currently produce I don't have any deadlocks or race conditions.

I decided to use the method with incrementing unique ids, because I use those ids in the second scenario for keeping track of simulation runs (eg. every new run generates a new folder based on the id). Nevertheless, one id is always related to one file only and vice versa. This is just a design decision, theoretically the implementation supports file independent metadata storage.

In general, the package is build around DrWatson (or a Julia project at least). For example, I only store paths relative to the project directory, this way the metadata folder can be used on other devices as well.

Let me know what you think. I'll keep testing the new workflow, maybe it turns out it's not such a necessary feature after all.

Datseris commented 4 years ago

and maybe a bit boredome

damn, I have an entire pipeline of projects for JuliaDynamics if you are interested! :D

Jokes aside, thanks a lot for sharing, this seems promising. I will read it in detail and will discuss further on our next meeting! @JonasIsensee , @tamasgal if you guys have some spare time please have a look as well and we can all talk about it! :)

sebastianpech commented 4 years ago

Boredom in a topic related sense. So quite likely procrastination actually :)

JonasIsensee commented 4 years ago

Hey @SebastianM-C , this is a really neat idea! I'm definitely going to try this out at some point.

Some questions: Could this be integrated with a cluster queue? How would the queue jobs connect to the metadata guard in that case?

What would happen if the parent process, a.k.a. the one guarding the metadata dies in the meantime? I guess there should probably be a fallback to save the metadata file in the same folder as the data.

sebastianpech commented 4 years ago

What would happen if the parent process, a.k.a. the one guarding the metadata dies in the meantime?

Once the metadata file is created reading and writing is no problem. I assume, that if you have parallel processes with IO that read and write the same file ie. access the same metadata you have some have taken some precautions your self to not have race conditions.

Could this be integrated with a cluster queue? How would the queue jobs connect to the metadata guard in that case?

I've been thinking about this as well and in the current implementation it's not possible without limitations. If you can assure that all jobs access the same folder, it works though.

This makes me think if it is maybe a better decision to store the metadata with the actual file eg. for somefile there is a .somefile.metadata in the same folder. The interface could stay the same and for the simulation part one would need a alternative method for generating the contiguous ids.

sebastianpech commented 4 years ago

@JonasIsensee @Datseris I changed the method for storing metadata. I thought about the cluster computations and the problem of merging the metadata. With incrementing ids this is pretty though and one needs an extra step for importing data. What I'm doing now is basically using the .metadata folder as my index. I generated a hash that's unique for every path in the project directory and use that as the name for the metadata file. This way lookup for a known path is very simple and an O(1) operation. Searching for a field value is still O(n).

The huge advantage of this method is, that one can now merge two metadata folder just by copying the content over, no need to take care of updating any ids.

To get a unique folder id for a simulation run, I now just consider the one folder that holds all the simulation folders and pick the smallest positive integer, that has not been taken by another simulation run.

Datseris commented 4 years ago

@sebastianpech I've just checked out your code. I think it seems great. I would still like you to explain the function @run live, I had trouble understanding it fully. I hope this functionality can work in the even more fundamental level of having only a single "parameter set" so that dicts is not actually necessary.

I think I understood that all process is project-directory-relative. However, the end of the readme states: If p is a relative path, make it absolute using abspath, otherwise leave p as it is. This confused me a bit and would be another point to clarify.

tamasgal commented 4 years ago

Interesting indeed. I have to look at it closer, although my workflow is quite different. It would probably be cool to demo this in the next call?

sebastianpech commented 4 years ago

@tamasgal

Interesting indeed. I have to look at it closer, although my workflow is quite different. It would probably be cool to demo this in the next call?

@Datseris

I would still like you to explain the function @run live, I had trouble understanding it fully. I hope this functionality can work in the even more fundamental level of having only a single "parameter set" so that dicts is not actually necessary.

Yes, I can show you some stuff tomorrow. @run works for a single parameter also, but I'm not fully satisfied with its flexibility yet maybe you guys come up with a better idea.

I think I understood that all process is project-directory-relative. However, the end of the readme states: If p is a relative path, make it absolute using abspath, otherwise leave p as it is. This confused me a bit and would be another point to clarify.

My idea here was that considering paths relative to the cwd allows a workflow where you can quickly lookup metadata for a file in the current folder. So if I'm in eg. in the plots directory I just need to spawn julia and do:

using DrWatson
@quickactivate
using DrWatsonSim
Metadata("myplot.png") # Aso autocompletion of the path works here

However, to lookup the metadata I need the hash for the projectdir relative path. So I first make the path absolute and then relative again, but now relative to the projectdir

Tokazama commented 3 years ago

From the above discussion it's still not entirely clear to me what the desired outcome is for this. Is version control at all a part of what you're going for here? Are you simply trying to create a way of intuitively accessing the contents of DrWatson.jl's "data" folder or is it all of the folders that you could end up writing things to (plots, data, notebooks)?

sebastianpech commented 3 years ago

@Tokazama well it's to parts (managing simulations and storing metadata). Storing the metadata is required for managing the simulations. The thing is we already store additional data, but this only works if you saving your data in a format that supports it (bson, json, hdf5, ...). If you store figures or other files which don't have support for metadata you can't do it (besides putting everything into the filename).

The metadata interface in DrWatsonSim.jl basically allows storing an additional bson file with any file in your DrWatson.jl project folder and puts all your additional data in there. It also takes care of locking, hash generation, ... (I'm using it for quite some time now and it is pretty robust).

The simulation part then just utilizes this functionality (so the metadata part could actually be a separate package) by creating a folder for each parameter configuration (which works as the working directory for the simulation run) and adding the additional info like the used parameters, script the started the simulation, environmental information, ... to the folder's metadata entry.

Tokazama commented 3 years ago

If you store figures or other files which don't have support for metadata you can't do it (besides putting everything into the filename).

By this do you mean you store some image but you also want to store some metadata about the date, experiment source, etc.?

The metadata interface in DrWatsonSim.jl basically allows storing an additional bson file with any file in your DrWatson.jl project folder and puts all your additional data in there. It also takes care of locking, hash generation, ... (I'm using it for quite some time now and it is pretty robust).

I think I'm starting to get it but I'm a bit dense so you may have to be patient with me. From what you're telling me and what I see in DrWatsonSim.jl I still have a question about the versioning part. Do you ever overwrite a commit and update/overwrite metadata or do you just accumulate data with each run?

sebastianpech commented 3 years ago

By this do you mean you store some image but you also want to store some metadata about the date, experiment source, etc.?

Exactly.

I still have a question about the versioning part. Do you ever overwrite a commit and update/overwrite metadata or do you just accumulate data with each run?

Metadata is not checked in into VC and every file (identified by it's path) can only have one metadata file. Metadata is stored in a folder in the project directory which I usually add to .gitignore. The system does throw a warning if you overwrite a file and don't update the metadata.

sebastianpech commented 3 years ago

This is an example with the minimum required configuration. Working directory is empty on start.

julia> using DrWatsonSim

(@v1.5) pkg> activate . # An env must be activated to define where the metadata folder is created
 Activating new environment at `~/test/metadata/Project.toml`

shell> touch somefile

julia> m = Metadata("somefile") # create or load the metadata entry for "somefile"
[ Info: Metadata directory not found, creating a new one
Metadata()

shell> tree -a
.
├── .metadata
│   └── 15455077044390697181.bson
└── somefile

1 directory, 2 files

julia> using BSON

julia> BSON.load(".metadata/15455077044390697181.bson")
Dict{String,Any} with 3 entries:
  "mtime" => 1.60469e9
  "data"  => Dict{String,Any}() # No data here
  "path"  => "somefile"

julia> m["A"] = 10
10

julia> m["B"] = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> BSON.load(".metadata/15455077044390697181.bson")
Dict{String,Any} with 3 entries:
  "mtime" => 1.60469e9
  "data"  => Dict{String,Any}("B"=>[1, 2, 3],"A"=>10)
  "path"  => "somefile"

tamasgal commented 3 years ago

I can quickly explain provenance (as it is currently defined by a small group, including me) for our "largish" experiment. We base our definitions on the W3C recommendations: https://www.w3.org/TR/prov-o/ There is a primer which is a good entrypoint: https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/

The rough idea of data provenance is keeping track of the agent,entityandaction` throughout a processing chain. An actor is something which does an action and creates an entity, so to say. Everything is uniquely identifiable (via a UUID) and you store the whole provenance in a centrally accessible database.

Maybe a small example of a workflow explains it a bit:

A script is launched: this is done by an agent e.g. a user or a process which was spawned on an automatic processing pipeline. This is already an activity and gets a UUID assigned. The script itself can now produce some data (entities) and also use other data as input (also entities). Each produced entity again gets a UUID etc. etc. The provenance information then is recording all this and stores it in an XML/JSON/YouNameIt file.

In the end, you will end up with a bunch of new entities which can be shared.

The idea behind this is quite obvious: you receive a file the_mysterious_file.bson and want to know its history, basically how it ended up on your machine back to its very first fobj = open("the_mysterious_file.bson", "w"). You now need the UUID of this file. This UUID can be stored externally, in a provenance file, which is obviously also needed then, or it is stored inside the file. You may or may not also have access to some, or the full provenance information inside the file or in the external provenance file. That's up to the implementation. In case of a publicly available provenance database, you could simply query it and ask for the provenance information of a specific UUID. It should then give you the all the information it has related to that, including parent activities and other entities.

Tokazama commented 3 years ago

So what we really need is a package that provides a generic PROV-O API that all these other file formats (OWL,RDF,XML, etc.) could take advantage of ?

JuliaDynamics / DrWatson.jl

Possible data provenance functionality #151