VIDA-NYU / reprozip

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
https://www.reprozip.org/
BSD 3-Clause "New" or "Revised" License
305 stars 34 forks source link

[DEBIAN-913781] Script accesses internal dpkg database #329

Closed remram44 closed 6 years ago

remram44 commented 6 years ago

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913781

This package contains a scripts, which directly access the dpkg internal database, instead of using one of the public interfaces provided by dpkg. The code in «reprozip/tracer/linux_pkgs.py» should be switched to use «dpkg-query --listfiles PKGNAME...». To avoid a performance loss, the code can batch multiple packages on a single call (according to the command-line length limit), which will get output as different stanzas separated by a blank line (even if the package does not exist).

This is a problem for several reasons, because even though the layout and format of the dpkg database is administrator friendly, and it is expected that those might need to mess with it, in case of emergency, this “interface” does not extend to other programs besides the dpkg suite of tools. The admindir can also be configured differently at dpkg build or run-time. And finally, the contents and its format, will be changing in the near future.

A better way would be nice for sure, but I'm not sure it exists. The database changing format will be a problem though.

remram44 commented 6 years ago

On Thu, 2018-11-15 at 00:37:10 -0500, Rémi Rampin wrote:

Upstream developer here. ReproZip needs to match from filename to package, not the other way around. It formerly used dpkg-query -S FILENAME to do this, but this was switched to reading the database directly for performance reasons (exact commit is https://github.com/ViDA-NYU/reprozip/commit/b085c41035959a451d77b0defa8c8b4ba025f47e). When many files need to be looked up, it is more efficient to do a single read through the info/*.list files than to do many such queries.

Ah right sorry, read that code sideways.

If I am missing a fast way to do this using the correct external interface, please let me know, and I will gladly update the code.

The same principle I proposed for --listfiles can be used for --search, you'd just batch as many filenames as can possibly fit within the command-line length limit (ARG_MAX - environment length) to reduce as many dpkg-query calls as possible. Doing a «dpkg-query --search» call per filename will indeed end up being very expensive.

I didn't know this was possible!