broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Add --no-name gzip flag to compression file output #50

Closed gwaybio closed 3 years ago

gwaybio commented 4 years ago

We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.

As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.

The way to remove the timestamp from the file is to pass a --no-name (-n) flag to the gzip command. See http://linuxcommand.org/lc3_man_pages/gzip1.html

Fortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas gzip compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.

Three Options

For the pandas or python option, the solution should ideally live in pycytominer. I've created a stub for this at cytomining/pycytominer#83

gwaybio commented 3 years ago

fixed in #63

shntnu commented 3 years ago

Awesome!!!