Open nuest opened 5 years ago
Files whose stats only are looked at (not content) don't get packed, so it is probable that the program actually read all those files. I can see that being a problem size-wise, and you might want to remove them from the configuration file, unfortunately ReproZip can't really tell the difference between files "opened and needed" and files "opened just to look".
Special code could be written for the case of fonts, but only if we can figure out which ones actually get used...
I can reproduce it here, matplotlib looks at all the fonts. They actually get read.
I am not sure there is something we can do about it :sweat_smile: Maybe something about the specific parts of the files that are read, or we can hook whatever library is used to enumerate the fonts? Thank you for reporting!
Could it be possible to filter specific files from the rpz? E.g. everything in ~/.local/share/fonts
?
The many font files might also be the issue why the created graphs are unusable... and I noticed a few copyrighted fonts there, too, which I shouldn't upload as part of an rpz to a repo.
The file /home/daniel/.cache/matplotlib/fontList.json
was also on the list of a second run I made, so I investigated: if that file exists, then the font files in ~/.local/share/fonts
are not touched.
Without the font cache, .local/share/fonts
occurs 330 times in my config.yml
and 87 packages named fonts-...
are added to the list.
With the font cache and a manual configuration of the default font only the selected font (/usr/share/fonts/truetype/msttcorefonts/Arial.ttf
) and
Maybe the recommendation is simply to run the trace twice and define a default font if you use matplotlib?
@remram44 It would be great if you could reproduce my findings. Not sure what reprozip can do better... reducing the size of the rpz is not really my issue, but the understandability of the config.yml
surely is.
tl;dr By running reprozip trace
twice and using a default font I could reduce the size of config.yml
from 10543 to 6492 lines, and the file size of the .rpz
went from 328 MB to 162 MB.
This might be something ReproZip could look for, e.g. if matplotlib/fontList.json
gets written, recommend that the user run the trace a second time.
Or even check if the trace writes anything to ~/\.*
and point that out? When a workflow writes something into a config directory of my home, it's probably good that the user knows that.. although it could also confuse them. Probably it is better to connect this kind of observation with a clear recommendation, as in case of matplotlib/fontList.json
.
This is likely to catch things that write there every time, so we can't recommend that the user run the experiment a second time on all those files.
There could be a warning though, the same we have for files that are read then overwritten.
I am trying to pack a Python 2-based model, and see a lot of font-related files in my config.yml.
Note: I have not packed or unzipped yet!
More information below, a "Don't worry about it, storage for fonts is cheap" might be a perfectly fine answer to this issue.
Snippet of
packages
:Snippet of
other_files
:I'm rougly doing the following:
I am running this in a virtual environment. If that might shake things up, I'm happy to extend the script above to also do that.