We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.
As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.
Fortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas gzip compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.
Three Options
pandas v1.1 option (assuming that it solves this problem!)
We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.
As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.
The way to remove the timestamp from the file is to pass a
--no-name
(-n
) flag to the gzip command. See http://linuxcommand.org/lc3_man_pages/gzip1.htmlFortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas
gzip
compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.Three Options
For the pandas or python option, the solution should ideally live in
pycytominer
. I've created a stub for this at cytomining/pycytominer#83