broadinstitute / lincs-cell-painting

Processed Cell Painting Data for the LINCS Drug Repurposing Project
BSD 3-Clause "New" or "Revised" License
25 stars 13 forks source link

Feature selected files available, with more row annotations included #48

Closed shntnu closed 3 years ago

shntnu commented 4 years ago
shntnu commented 4 years ago

md5's of the .csv.gz files differ but the data frames are identical

library(tidyverse)

filename <- "2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz"

base_path <- "~/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus"
gway_path <- "2016_04_01_a549_48hr_batch1"
shsingh_path <- "2016_04_01_a549_48hr_batch1_shsingh"

gway_file <- file.path(base_path, gway_path, filename)
shsingh_file <- file.path(base_path, shsingh_path, filename)

tools::md5sum(gway_file)
#> /Users/shsingh/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus/2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz 
#>"f3ef52cd3ccf9cce0e2e9713ee11a721"
tools::md5sum(shsingh_file)
#> /Users/shsingh/work/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/software/lincs-cell-painting/consensus/2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_modz.csv.gz 
#> "c6838f40b5a39de2f48ec6dc86b4360f"

gway_df <- read_csv(gway_file)
shsingh_df <- read_csv(shsingh_file)
compare::compare(gway_df, shsingh_df)
#> TRUE

Created on 2020-05-29 by the reprex package (v0.3.0)

shntnu commented 4 years ago

Simpler test:

No diffs when uncompressed

$ diff <(gzcat 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz)

But binaries differ

$ diff 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
Binary files 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz and 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz differ

and that's because of the date

$ file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz 2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:         gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Fri May 22 12:56:15 2020, max compression, original size 157348070
2016_04_01_a549_48hr_batch1_shsingh/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Sat May 30 00:14:25 2020, max compression, original size 157348070

There must be some git attribute setting to address this

shntnu commented 4 years ago

Relevant links

shntnu commented 4 years ago

One solution is to specify the diff rule for gz files in your ~/.gitconfig (below) and use that in the .gitattributes (as I have done in this PR):

[diff "gzip"]
  binary = true
  textconv = /usr/bin/gzcat
shntnu commented 4 years ago

@gwaygenomics I just realized that you must have encountered this issue, right? Do you explicitly check for diffs rather than relying on git e.g. diff <(gunzip -c a.csv.gz) <(gunzip -c b.csv.gz)

shntnu commented 4 years ago

(there's an additional, separate issue related to this https://github.com/cytomining/pycytominer/issues/82 which affects the GCT file)

gwaybio commented 4 years ago

I just realized that you must have encountered this issue, right?

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

shntnu commented 4 years ago

Woah, never knew this was a thing. Thanks for digging into it! So, if I'm understanding correctly, the md5sum for the file we provided CLUE will fail to validate?

Nope, that aspect of it is fine.

It is actually a somewhat minor but annoying issue.

If you rerun any notebook that produces a file and then gzips it, git will detect the .gz as a modified file because the hashes are different (because the dates of the compressed data are different, even if the contents the file are identical). See below.

We now have a fix! (old notes are collapsed below). We need to use gzip -n (don't save original file name or time stamp) and we are all set!

mkdir gittest
cd gittest/
git init .
# Initialized empty Git repository in /private/tmp/gittest/.git/

echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz

git add x.txt.gz
git commit -m "add file"
# [master (root-commit) 540f687] add file
#  1 file changed, 0 insertions(+), 0 deletions(-)
#  create mode 100644 x.txt.gz

echo a b c > x.txt
file x.txt
# x.txt: ASCII text
gzip -n x.txt
# x.txt.gz already exists -- do you wish to overwrite (y or n)? y
file x.txt.gz
# x.txt.gz: gzip compressed data, from Unix, original size 6
md5sum x.txt.gz
# ca04a6662ec96a20339f793db203b9c6  x.txt.gz

git status
# On branch master
# nothing to commit, working tree clean
Click to expand old notes ``` $ mkdir gittest $ cd gittest/ $ git init . Initialized empty Git repository in /private/tmp/gittest/.git/ $ echo a b c > x.txt $ file x.txt x.txt: ASCII text $ gzip x.txt $ file x.txt.gz x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:07:46 2020, from Unix, original size 6 $ md5sum x.txt.gz 3ab6b3c300d2e33106f6aa13afe23a60 x.txt.gz $ git add x.txt.gz $ git commit -m "add file" [master (root-commit) 540f687] add file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 x.txt.gz $ echo a b c > x.txt $ file x.txt x.txt: ASCII text $ gzip x.txt x.txt.gz already exists -- do you wish to overwrite (y or n)? y $ file x.txt.gz x.txt.gz: gzip compressed data, was "x.txt", last modified: Sat May 30 14:08:39 2020, from Unix, original size 6 $ md5sum x.txt.gz b7fed9241609e14b7278bf486dece3fd x.txt.gz $ git diff $ git status On branch master Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git restore ..." to discard changes in working directory) modified: x.txt.gz no changes added to commit (use "git add" and/or "git commit -a") $ git diff x.txt.gz diff --git a/x.txt.gz b/x.txt.gz index 8a132e5..6c6122d 100644 Binary files a/x.txt.gz and b/x.txt.gz differ $ cp x.txt.gz x.txt.gz.new $ git stash Saved working directory and index state WIP on master: 540f687 add file $ diff x.txt.gz x.txt.gz.new Binary files x.txt.gz and x.txt.gz.new differ $ diff <(gunzip -c x.txt.gz) <(gunzip -c x.txt.gz.new) (no difference) ``` However, if we do these two things: 1. Append this to **your** `~/.gitconfig` ``` [diff "gzip"] binary = true textconv = /usr/bin/gunzip -c ``` 2. Modify the `.gitattributes` of **this** repo, by changing this line ``` *.gz filter=lfs diff=lfs merge=lfs -text ```` to this ``` *.gz filter=lfs diff=gzip merge=lfs -text ``` then although a git status will show that the gz files have been modified, no diffs will be reported because it will use the `textconv` string to diff the contents. But this still doesn't solve the issue that a user may inadvertently do a `git add .` resulting in all the recreated (but not actually modified) gz files being added to the repo. More pondering.
shntnu commented 4 years ago

@gwaygenomics Please see the updates to the previous comment. We need to modify these lines appropriately so that gzip does not save the original file name nor time stamp (mimic as gzip -n) and then we are all set.

  1. https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

  2. https://github.com/broadinstitute/lincs-cell-painting/blob/e6852b49992b4fa2f0d36cc07011856ebee6326a/consensus/scripts/nbconverted/build-consensus-signatures.py#L206-L208

Unfortunately pandas does not currently support specifying options if the compression method is gzip, so you'd first need to save as CSV and then compress

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

shntnu commented 4 years ago

What a coincidence! https://github.com/pandas-dev/pandas/issues/28103

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue: https://github.com/pandas-dev/pandas/issues/28103#issuecomment-524354731

gwaybio commented 4 years ago

For now, I suggest we take the non-pretty approach from Dan's snippet to address this issue:

Awesome - great we could track this down! Will this change be made in this PR?

shntnu commented 4 years ago

Awesome - great we could track this down! Will this change be made in this PR?

Can this line https://github.com/broadinstitute/lincs-cell-painting/blob/e6852b49992b4fa2f0d36cc07011856ebee6326a/consensus/scripts/nbconverted/build-consensus-signatures.py#L206-L208

can be replaced with this?

cyto_utils.output(
    df=consensus_df,
    output_filename=consensus_file,
    float_format="%5g",
    compression=gzip,
)

I think yes, and if so, then the changes should be made in cyto_utils.output https://github.com/cytomining/pycytominer/blob/dd064c2185435e19541bafa7c976d55da15cf09e/pycytominer/cyto_utils/output.py#L56-L62

In that case, this PR will do 3 things, after the cyto_utils.output code is updated:

  1. Update environment.yml to point to the version of pycytominer that has the new cyto_utils.output
  2. A bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
  3. Update build-consensus-signatures.{py,ipynb} to use cyto_utils.output
  4. Re-run build-consensus-signatures.ipynb for neatness

Note that I recommend doing (2) instead of running all the scripts again (except the consensus one because that's easy) – that's too much work!

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"
gwaybio commented 4 years ago

Yep, we can make that update - indeed i do think it will improve pycytominer.cyto_utils.output.

In general, though, we want to avoid processing version 1 lincs-cell-painting with multiple pycytominer versions. i.e. the whole image-based-profiling pipeline was processed with pycytominer@dd064c2185435e19541bafa7c976d55da15cf09e, a small change introduced here for a relatively minor improvement would bump the version. So, unless we then reprocess the whole pipeline with the updated hash, it would be tough to track down which pieces were processed with which pycytominer version.

I recommend adding this fix to a version 2 wishlist, and writing the consensus output with the proposed gzip solution in the consensus notebook.

Is there something I am overlooking here? What do you think?

shntnu commented 4 years ago

I updated my previous comment (looks like we were responding simultaneously :D).

My only additional note is that this PR can be part of Version 2; no need to produce the GCT file for Version 1 in any case. Admittedly the GCT and the CSV issue are distinct components, and this PR might be a little too bulky for our liking, but it's not terrible.

shntnu commented 4 years ago

@gwaygenomics I don't think I've fully absorbed https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636473468. But given that this PR will go in version 2, is it ok I delay looking into this further? Or is that blocking you on your immediate goals?

shntnu commented 3 years ago

From https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636473468 @gwaygenomics said

I recommend ... writing the consensus output with the proposed gzip solution in the consensus notebook.

Although pycytominer now has the fix needed https://github.com/cytomining/pycytominer/issues/83#issuecomment-754640470, I'll do what's suggested above to avoid bumping up the version of pycytominer.

shntnu commented 3 years ago

This PR will do 2 things:

  1. ~Re-run build-consensus-signatures.ipynb for neatness~
  2. Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each
  3. Run the bash script

Bash script

# create gz,csv pairs
for gzip_file in `find . -name "*.csv.gz" `; 
do 
  echo $gzip_file,${gzip_file%.*}  
done > /tmp/rename.csv

# re-zip the file
parallel -a /tmp/rename.csv -C "," -L 2 \
  "echo gunzip {1} && echo gzip -n {2}"
gwaybio commented 3 years ago

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

  1. In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in cytomining/pycytominer#102)
  2. There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

shntnu commented 3 years ago
  • Add a bash script that gunzips all the .gz files in this repo and then runs gzip -n on each

This is not a good fix either, because it seems to introduce other, undesirable metadata (see metadata of new file)

Original file, i.e. before running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz: 
gzip compressed data, was "2016_04_01_a549_48hr_batch1_consensus_median.csv", 
last modified: Tue Mar  9 14:19:07 2021, 
max compression, 
original size modulo 2^32 157348070

New file, i.e.after running the script:

file 2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz

2016_04_01_a549_48hr_batch1/2016_04_01_a549_48hr_batch1_consensus_median.csv.gz:
gzip compressed data, 
from Unix, 
original size modulo 2^32 157348070 gzip compressed data, 
unknown method, 
has CRC, 
extra field, 
has comment, 
encrypted, 
from FAT filesystem (MS-DOS, OS/2, NT), 
original size modulo 2^32 157348070

@gwaygenomics Can you run file on .csv.gz file that was created using the updated output.py https://github.com/cytomining/pycytominer/pull/119 and paste it here.

If that does not have this extra metadata, it wouldn't make sense for us to include this manual gzip -nfix. Instead, I will push the changes as is (without the fix) and then actually re-run it later.

shntnu commented 3 years ago

Actually, coming to think of it, there's no upside to including the fix, because our next version will anyway have the updated pycytominer. I'll ponder and ping again.

shntnu commented 3 years ago

There's literally no upside.

I verified that this command did produce any outputs.

# copy newly create files to a new folder
cp -r 2016_04_01_a549_48hr_batch1 2016_04_01_a549_48hr_batch1_shsingh

# restore old files
git stash

# do a diff on gz contents
(ls 2016_04_01_a549_48hr_batch1/*.csv.gz) | parallel basename {} | parallel   "diff <(gzcat 2016_04_01_a549_48hr_batch1/{}) <(gzcat 2016_04_01_a549_48hr_batch1_shsingh/{})"

We are all set.

In a future version, we will re-run using the update output.py. Until then, re-running the notebook (ipython scripts/nbconverted/build-consensus-signatures.py) will produce .csv.gz files that differ from the repo, because of the timestamp issue. The profiles module of this repo will have the same issue.

shntnu commented 3 years ago

LGTM - my only question is: do we want to perform feature selection before creating the .gct?

Ah yes, we should! Will do

gwaybio commented 3 years ago

oh, @shntnu - I just realized you can also address #49 in this PR! It should be a simple fix to add MOA and target variables to the replicate_cols variable

shntnu commented 3 years ago

a quick heads up @shntnu - I am running through the spherize operations and I noticed a couple things:

  1. In the current pycytominer version in this repo, we still use the terms "blacklist" and "whiten" (we addressed this in cytomining/pycytominer#102)
  2. There is no current way to alter epsilon in the spherize implementation of pycytominer.normalize.

What should we do? If we update pycytominer version, we'll likely need to reprocess everything. It'll also enable us to output gzip profiles without timestamps

  1. I'm not too worried about it because we've done all the right things to fix it in the library, and it will make it into this repo in the next version anyway.

  2. https://github.com/cytomining/pycytominer/issues/128. If the default value of epsilion=1e-6 is fine, then we needn't fix that issue right now. How would we know whether it is fine or not? I suppose we can just do it very crudely and empirically for now: do the results improve similar to what we've seen in past analysis by Ted et al.?

  3. gzip to me is the most annoying thing :D But we can live with it using this solution https://github.com/broadinstitute/lincs-cell-painting/pull/48#issuecomment-636273257

So in all, we don't need to bump the pycytominer version if 2. works out fine

shntnu commented 3 years ago

@gwaygenomics bumping to make sure you saw that it's back on your desk to review (no hurry from my end)

Edit: I just realized that you're working on the spherizing thingie, so we need to sort that out first. Ping me if it needs anything from me.