VIDA-NYU / reprozip

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
https://www.reprozip.org/
BSD 3-Clause "New" or "Revised" License
305 stars 34 forks source link

`reprozip combine` creates non-canonical config.yml #265

Closed kaczmarj closed 7 years ago

kaczmarj commented 7 years ago

Hello, I am using reprozip combine to combine multiple traces. The output config.yml file has the key additional_patterns, which to my knowledge makes it non-canonical. Is this behavior intended?

I ask because I am working on a script to merge multiple reprozip pack files. All of the pack files I have been trying to merge are version 2, and none of the config.yml files has the key additional_patterns. The config.yml file created by reprozip combine does contain this key. This causes the following error when trying to unpack the merged pack file:

reprounzip.common.InvalidConfig: Canonical configuration file shouldn't have additional_patterns key anymore

Here is the script I wrote to merge pack files. I'm happy to share the traces databases.

Testing whether 'additional_patterns' is present in the various config.yml files:

(repro) root@3502d3e935d5:~/repro/nd# grep "additional_patterns" trace1/METADATA/config.yml
(repro) root@3502d3e935d5:~/repro/nd# grep "additional_patterns" trace2/METADATA/config.yml
(repro) root@3502d3e935d5:~/repro/nd# grep "additional_patterns" trace3/METADATA/config.yml
(repro) root@3502d3e935d5:~/repro/nd# grep "additional_patterns" merged/METADATA/config.yml
additional_patterns:

I am doing this work on a debian stretch docker image.

(repro) root@3502d3e935d5:~/repro/nd# reprozip --version
reprozip version 1.0.10
(repro) root@3502d3e935d5:~/repro/nd# uname -a
Linux 3502d3e935d5 4.9.38-moby #1 SMP Wed Jul 26 10:02:46 UTC 2017 x86_64 GNU/Linux
kaczmarj commented 7 years ago

The config.yml files above (e.g., traceN/METADATA/config.yml) are from the untarred rpz files. Those rpz files were made with reprozip pack. The merged pack file was made with this script.

remram44 commented 7 years ago

Hi @kaczmarj,

Indeed, the reprozip combine command only merges trace files (the trace.sqlite3 files), not packages or configs. After the traces are merged, a new non-canonical configuration file is written. The intent here is that you combine those right after tracing and before generating an RPZ.

The particular use case for this is MPI, where you get multiple trace for all the machines (since they usually use a shared file system), combine them, and then make a single package with files that were used on any of the machines.

Merging RPZ packages would need a bit more logic than what we currently have:

That is why it is not implemented yet.

You can make your script work by using a little bit of awk to remove the 'additional_patterns' section (since this is currently the only difference between a 'canonical' (found in an RPZ) and 'non-canonical' (found in .reprozip-trace, for the user to edit) config file), however be aware that reprozip combine ignores the input configuration files entirely and will generate a default one from the trace information.