VIDA-NYU / reprozip

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
https://www.reprozip.org/
BSD 3-Clause "New" or "Revised" License
305 stars 34 forks source link

could other_files be not dereferenced? #390

Closed yarikoptic closed 1 year ago

yarikoptic commented 1 year ago
❯ reprozip trace -d /tmp/reprozip-trace-`git describe` /bin/ls -l
...
Configuration file written in /tmp/reprozip-trace-1.1-274-gcac47797/config.yml
Edit that file then run the packer -- use 'reprozip pack -h' for help

❯ grep /bin/ls /tmp/reprozip-trace-1.1-274-gcac47797/*
/tmp/reprozip-trace-1.1-274-gcac47797/config.yml:  - /bin/ls
/tmp/reprozip-trace-1.1-274-gcac47797/config.yml:  binary: /bin/ls
/tmp/reprozip-trace-1.1-274-gcac47797/config.yml:  - "/usr/bin/ls" # 143.80 KB
grep: /tmp/reprozip-trace-1.1-274-gcac47797/trace.sqlite3: binary file matches

so - running /bin/ls has /usr/bin/ls listed in other_files of the config.yml, although that one nohow explicitly re-executed by /bin/ls

❯ strace -f -o /tmp/ls-strace.log /bin/ls -l > /dev/null
❯ grep bin/ls /tmp/ls-strace.log
1952259 execve("/bin/ls", ["/bin/ls", "-l"], 0x7fffe83bf550 /* 101 vars */) = 0
❯ md5sum {,/usr}/bin/ls
26f446f5c92841a0ba71dd041011baa7  /bin/ls
26f446f5c92841a0ba71dd041011baa7  /usr/bin/ls

and likely simply because we have now on debian systems

❯ ls -l /bin
lrwxrwxrwx 1 root root 7 Nov  5  2019 /bin -> usr/bin/

which would be all nice and dandy (for my desires of tracing in reproman), if dpkg could locate /usr/bin/ls one but it can't:

❯ dpkg -S /bin/ls /usr/bin/ls
coreutils: /bin/ls
dpkg-query: no path found matching pattern /usr/bin/ls

so I wonder if tracing could avoid dereferencing paths or it is unavoidable?

remram44 commented 1 year ago

reprozip writes a list of files, with their canonical names. There is no /bin/ls on disk, there is only a symlink /bin and a single file /usr/bin/ls.

If you want to know which files were executed, the trace has the correct information (in the executed_files table). The /bin symlink will also be automatically included in the other_files since it has to exist for /bin/ls to be reached.

I am not sure I understand your use-case. The point of config.yml is to list the files to be packed, and /bin/ls is not an existing file. Could you perhaps say a bit more about what you want to do?

remram44 commented 1 year ago

I would add that maybe if you want information that is not intended for creating an RPZ, the config.yml might be the wrong file to read. Reading trace.sqlite3 is probably more appropriate, you can even adapt the get_files() function to create the list you want from it. My version of this function does other things you might not want, such as applying filters that will silently remove some files (like .pyc files) that we don't want to pack.