Closed shntnu closed 3 years ago
md5's of the .csv.gz files differ but the data frames are identical
library(tidyverse)
filename <- "2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz"
base_path <- "~/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus"
gway_path <- "2016_04_01_a549_48hr_batch1"
shsingh_path <- "2016_04_01_a549_48hr_batch1_shsingh"
gway_file <- file.path(base_path, gway_path, filename)
shsingh_file <- file.path(base_path, shsingh_path, filename)
tools::md5sum(gway_file)
#> /Users/shsingh/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz
#>"f3ef52cd3ccf9cce0e2e9713ee11a721"
tools::md5sum(shsingh_file)
#> /Users/shsingh/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus/2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz
#> "c6838f40b5a39de2f48ec6dc86b4360f"
gway_df <- read_csv(gway_file)
shsingh_df <- read_csv(shsingh_file)
compare::compare(gway_df, shsingh_df)
#> TRUE
Created on 2020-05-29 by the reprex package (v0.3.0)
Simpler test:
No diffs when uncompressed
$ diff <(gzcat 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz)
But binaries differ
$ diff 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
Binary files 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz and 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz differ
and that's because of the date
$ file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv",
last modified: Fri May 22 12:56:15 2020, max compression, original size 157348070
2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv",
last modified: Sat May 30 00:14:25 2020, max compression, original size 157348070
There must be some git attribute setting to address this
One solution is to specify the diff rule for gz
files in your ~/.gitconfig
(below) and use that in the .gitattributes
(as I have done in this PR):
[diff "gzip"]
binary = true
textconv = /usr/bin/gzcat
@gwaygenomics I just realized that you must have encountered this issue, right? Do you explicitly check for diffs rather than relying on git e.g. diff <(gunzip -c a.csv.gz) <(gunzip -c b.csv.gz)
(there's an additional, separate issue related to this https://github.com/cytomining/pycytominer/issues/82 which affects the GCT file)
I just realized that you must have encountered this issue, right?
Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?
Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?
Nope, that aspect of it is fine.
It is actually a somewhat minor but annoying issue.
If you rerun any notebook that produces a file and then gzips it, git will detect the .gz as a modified file because the hashes are different (because the dates of the compressed data are different, even if the contents the file are identical). See below.
We now have a fix! (old notes are collapsed below). We need to use gzip -n
(don't save original file name or time stamp) and we are all set!
mkdir gittest
cd gittest/
git init .
# Initialized empty Git repository in /private/tmp/gittest/.git/
echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6 x.txt.gz
git add x.txt.gz
git commit -m "add file"
# [master (root-commit) 540f687] add file
# 1 file changed, 0 insertions(+), 0 deletions(-)
# create mode 100644 x.txt.gz
echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
# x.txt.gz already exists -- do you wish to overwrite (y or n)? y
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6 x.txt.gz
git status
# On branch master
# nothing to commit, working tree clean
@gwaygenomics Please see the updates to the previous comment. We need to modify these lines appropriately so that gzip
does not save the original file name nor time stamp (mimic as gzip -n
) and then we are all set.
Unfortunately pandas does not currently support specifying options if the compression method is gzip, so you'd first need to save as CSV and then compress
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
What a coincidence! https://github.com/pandas-dev/pandas/issues/28103
For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue: https://github.com/pandas-dev/pandas/issues/28103#issuecomment-524354731
For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:
Awesome - great we could track this down! Will this change be made in this PR?
Awesome - great we could track this down! Will this change be made in this PR?
can be replaced with this?
cyto_utils.output(
df=consensus_df,
output_filename=consensus_file,
float_format="%5g",
compression=gzip,
)
I think yes, and if so, then the changes should be made in cyto_utils.output
https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62
In that case, this PR will do 3 things, after the cyto_utils.output
code is updated:
environment.yml
to point to the version of pycytominer
that has the new cyto_utils.output
gunzip
s all the .gz
files in this repo and then runs gzip -n
on eachbuild-consensus-signatures.{py,ipynb}
to use cyto_utils.output
build-consensus-signatures.ipynb
for neatnessNote that I recommend doing (2) instead of running all the scripts again (except the consensus one because that's easy) – that's too much work!
Bash script
# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `;
do
echo $gzip_file,${gzip_file%.*}
done > /tmp/rename.csv
# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
"echo gunzip {1} && echo gzip -n {2}"
Yep, we can make that update - indeed i do think it will improve pycytominer.cyto_utils.output
.
In general, though, we want to avoid processing version 1 lincs-cell-painting with multiple pycytominer versions. i.e. the whole image-based-profiling pipeline was processed with pycytominer@dd064c2185435e19541bafa7c976d55da15cf09e, a small change introduced here for a relatively minor improvement would bump the version. So, unless we then reprocess the whole pipeline with the updated hash, it would be tough to track down which pieces were processed with which pycytominer version.
I recommend adding this fix to a version 2 wishlist, and writing the consensus output with the proposed gzip solution in the consensus
notebook.
Is there something I am overlooking here? What do you think?
I updated my previous comment (looks like we were responding simultaneously :D).
My only additional note is that this PR can be part of Version 2; no need to produce the GCT file for Version 1 in any case. Admittedly the GCT and the CSV issue are distinct components, and this PR might be a little too bulky for our liking, but it's not terrible.
@gwaygenomics I don't think I've fully absorbed https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636473468. But given that this PR will go in version 2, is it ok I delay looking into this further? Or is that blocking you on your immediate goals?
From https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636473468 @gwaygenomics said
I recommend ... writing the consensus output with the proposed gzip solution in the
consensus
notebook.
Although pycytominer
now has the fix needed https://github.com/cytomining/pycytominer/issues/83#issuecomment-754640470, I'll do what's suggested above to avoid bumping up the version of pycytominer
.
This PR will do 2 things:
build-consensus-signatures.ipynb
for neatness~gunzip
s all the .gz
files in this repo and then runs gzip -n
on eachBash script
# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `;
do
echo $gzip_file,${gzip_file%.*}
done > /tmp/rename.csv
# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
"echo gunzip {1} && echo gzip -n {2}"
a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:
pycytominer.normalize
.What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps
- Add a bash script that
gunzip
s all the.gz
files in this repo and then runsgzip -n
on each
This is not a good fix either, because it seems to introduce other, undesirable metadata (see metadata of new file)
Original file, i.e. before running the script:
file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:
gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv",
last modified: Tue Mar 9 14:19:07 2021,
max compression,
original size modulo 2^32 157348070
New file, i.e.after running the script:
file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:
gzip compressed data,
from Unix,
original size modulo 2^32 157348070 gzip compressed data,
unknown method,
has CRC,
extra field,
has comment,
encrypted,
from FAT filesystem (MS-DOS, OS/2, NT),
original size modulo 2^32 157348070
@gwaygenomics Can you run file
on .csv.gz
file that was created using the updated output.py
https://github.com/cytomining/pycytominer/pull/119 and paste it here.
If that does not have this extra metadata, it wouldn't make sense for us to include this manual gzip -n
fix. Instead, I will push the changes as is (without the fix) and then actually re-run it later.
Actually, coming to think of it, there's no upside to including the fix, because our next version will anyway have the updated pycytominer. I'll ponder and ping again.
There's literally no upside.
I verified that this command did produce any outputs.
# copy newly create files to a new folder
cp -r 2016_04_01_a549_48hr_batch1 2016_04_01_a549_48hr_batch1_shsingh
# restore old files
git stash
# do a diff on gz contents
(ls 2016_04_01_a549_48hr_batch1/*.csv.gz) | parallel basename {} | parallel "diff <(gzcat 2016_04_01_a549_48hr_batch1/{}) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/{})"
We are all set.
In a future version, we will re-run using the update output.py
. Until then, re-running the notebook (ipython scripts/nbconverted/build-consensus-signatures.py
) will produce .csv.gz
files that differ from the repo, because of the timestamp issue. The profiles
module of this repo will have the same issue.
LGTM - my only question is: do we want to perform feature selection before creating the .gct?
Ah yes, we should! Will do
oh, @shntnu - I just realized you can also address #49 in this PR! It should be a simple fix to add MOA and target variables to the replicate_cols
variable
a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:
- In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in cytomining/pycytominer#102)
- There is no current way to alter epsilon in the spherize implementation of
pycytominer.normalize
.What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps
I'm not too worried about it because we've done all the right things to fix it in the library, and it will make it into this repo in the next version anyway.
https://github.com/cytomining/pycytominer/issues/128. If the default value of epsilion=1e-6
is fine, then we needn't fix that issue right now. How would we know whether it is fine or not? I suppose we can just do it very crudely and empirically for now: do the results improve similar to what we've seen in past analysis by Ted et al.?
epsilion=1e-6
is fine and there's nothing to be done here.gzip
to me is the most annoying thing :D But we can live with it using this solution https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636273257
So in all, we don't need to bump the pycytominer version if 2. works out fine
@gwaygenomics bumping to make sure you saw that it's back on your desk to review (no hurry from my end)
Edit: I just realized that you're working on the spherizing thingie, so we need to sort that out first. Ping me if it needs anything from me.
gzip -n
problem https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-794409806