Workflow for managing multiple output files, or files with too many parameters, and metadata

stillyslalom commented 2 years ago

I've been using DrWatson.jl to organize my preprocessing code/analysis for an ensemble of experiments performed in a large, complex facility. I've struggled to find an ergonomic workflow that captures the entire pipeline. Issues include:

Too many experiment-relevant settings to capture in savename format without exceeding filesystem length limits
Experimental ensembles are collected using the same nominal conditions for each data series, but contain idiosyncrasies (inoperable instruments, flubs, sensor/timing tweaks) that may need to be handled on a case-by-case basis
External tools (ImageJ, Matlab) and user input fit awkwardly in the middle of a produce_or_load data-processing pipeline that assumes hands-off, end-to-end Julia code
Data in instrument-specific formats is best kept in the filesystem where it can be handled with specialized external software rather than packed into JLD

This issue is partly a reminder to myself to write suitable documentation once I arrive at a good workflow.

sebastianpech commented 2 years ago

That sound very familiar to me. You might want to take a look at https://github.com/sebastianpech/DrWatsonSim.jl. It s kind of a spin-off of DrWatson and supports storing metadata, which I use instead of savename. Besides that I guess it's almost impossible to find a project independent approach for capturing your whole workflow.

What I do is exporting all parameter configurations for all my simulations to a note taking app and add additional annotations and documentation there. This works quite well and can mostly be automated. As DrWatsonSim works with simulation IDs instead of unique savenames, I can always find my notes and simulation results by this id.

Datseris commented 2 years ago

@stillyslalom can you try this out and let us know how it works for you? @sebastianpech I suggest that we start with writing a "real world example" in the DrWatson docs that showcases (briefly of course) your workflow. Based on that perhaps we can think of integrating DrWatsonSim directly into DrWatson if possible? Seems like many other people have asked for something similar and it always goes back to something like your approach...

sebastianpech commented 2 years ago

@Datseris Yes, seems so. I'll open a PR and polish the code a bit. I think I kept it general enough to integrate it in DrWatson. And it's all opt-in features, so no existing workflows will break and the metadata directory is only created when a metadata-related function is called the first time, so the folder structure also remains unchanged.

sebastianpech commented 2 years ago

Oh and @Datseris which file format do we prefer. The version on the master branch uses BSON for storing metadata, but I also have a separate branch using JLD2. Haven't merged it yet as I didn't want to update all my old project folders, but I guess as we have JLD2 as a dependency in DrWatson, I will do the switch.

Datseris commented 2 years ago

Yeah there is a clear statement in DrWatson now for JLD2 preference since 2.0. Do note that JLD2 cannot save functions though. Do you need this?

sebastianpech commented 2 years ago

Aha! I remember why I didn't do the switch yet. Yes I store all kinds of stuff in the bson files. However, it's not a requirement for the metadata functions to work. I think it's easy to workaround this. I will add a note to the docs.

Datseris commented 2 years ago

We have a similar difficulty in Agents.jl where we can't store functions-part of the model, only parameters that are not functions. However, JLD2 can store function-like objects with e.g., singleton dispatch.

struct Object end
(o::Object)(x, y) = x+y

instances of Object can be stored and when loaded will behave like functions.

sebastianpech commented 2 years ago

Interesting syntax. Never seen it. Cool. Good to know.

Datseris commented 2 years ago

Isn't it possible to make the system backend independent? Just choose via a keyword or environment variable which save backend to use?

sebastianpech commented 2 years ago

Sure, that's a good option.

JonasIsensee commented 2 years ago

We have a similar difficulty in Agents.jl where we can't store functions-part of the model, only parameters that are not functions. However, JLD2 can store function-like objects with e.g., singleton dispatch.
struct Object end
(o::Object)(x, y) = x+y
instances of Object can be stored and when loaded will behave like functions.

Hi, quick note: This only works, when the Object and its method is defined in the new session already.

sebastianpech commented 2 years ago

I just tested this

foo(x) = x^2
save("test.jld2", Dict("fun"=>foo))
load("test.jld2")["fun"](2) # Gives 4

and it works. @Datseris what did you mean by storing functions? Do you mean actually storing them without the need to define them in the file you are loading them again?

I just ran through 1000+ metadata files I stored and I can convert all of them from BSON to JLD2

Datseris commented 2 years ago

@JonasIsensee please comment here, as far as I know JLD2 cannot save functions. (Or, to put it in a better way: it is advised to not save functions with it. By whom, I actually do not remember anymore)

liuyxpp commented 2 years ago

I just ran through 1000+ metadata files I stored and I can convert all of them from BSON to JLD2

Is it possible to use toml or yml files for storing metadata?

I used to do my simulation work by storing the configurations, the metadata, and the results (which by analyzing the simulation data on the fly) in separate yml files prior to knowing DrWatson. I really like the produce_and_load and collect_results functions. So I want to give DrWatson a try:)

Datseris commented 2 years ago

@liuyxpp to me it feels like using Yml would lead to artificial limitations. I mean, why? Why use this format instead of a native Julia format? In your use case it might be that every metadata that you save is either a number or a string, but why not have the possibility to save arbitrary Julia types as metadata? That's exactly why one should go with JLD2 or BSON.

liuyxpp commented 2 years ago

Yes, only strings and numbers are stored. I use yml partly because sometimes I want to check the files by eye. The other reason is my simulation work involves several programs and some of them are written in C++. But I am working on rewriting them in Julia.

I get your point now and the native Julia format is OK for me once I have done the transition of my simulation programs.

sebastianpech commented 2 years ago

@liuyxpp I get your point. I usually additionally convert the parameters dict into a string representation and store this separately to quickly see the parameters I used for each simulation. You could, for example, auto-generate this string representation and store it in an additional file in each simulation directory:

if in_simulation_mode()
    m = Metadata(simdir()) # load metadata for this simulation
    # store m["parameters"] somehow in simdir()
end

simdir works just like the directory function in DrWatson. If you are in an active simulation simdir("params.toml") resolves to [absolute path to simulation directory]/params.toml

Datseris commented 2 years ago

I now realize this is useful for me as well. I'm saving my simulations directly into a master dataframe. But now I want to make a new column and everything breaks down. Then I realized we have this super awesome collect_resulst function. But my input data are simply too large and complex for me to use savename to uniquely extract a file name for each simulation. So this is where the hashing of DrWatsonSim would be very useful! @sebastianpech did you have any progress getting this PR started?

sebastianpech commented 2 years ago

@Datseris I'm a little bit low on time at the moment. I'll try to get it started over the weekend.

Datseris commented 2 years ago

no stress!

Ickaser commented 2 years ago

Bumping to ask--did this ever get implemented into DrWatson proper? This sounds very useful to me. I can also try using DrWatsonSim on its own, of course, which I may do in the meantime.

Datseris commented 2 years ago

In PR https://github.com/JuliaDynamics/DrWatson.jl/pull/366 I am providing an intermediate solution that uses hash on the given configuration container passed to produce_or_load, to provide a unique string for more complicated input configurations.

JuliaDynamics / DrWatson.jl

Workflow for managing multiple output files, or files with too many parameters, and metadata #316