dhimmel / lincs

Library of Integrated Cellular Signatures L1000
https://think-lab.github.io/d/43/
25 stars 5 forks source link

Download modzs.gctx #3

Closed stuppie closed 7 years ago

stuppie commented 7 years ago

I can't figure out where to get modzs.gctx, which is needed to construct the signature dataframe sig_expr_df in consensi.ipynb. From here, you say:

The z-score signature vectors are retrieved from the /xchip/cogs/data/build/a2y13q1/modzs.gctx file on the C3 cloud.

But this was 2 years ago and the link doesn't work anymore. Also, I'm not exactly sure what this file is exactly or how it was generated.

I appreciate your help in advance!

dhimmel commented 7 years ago

@stuppie, It looks like the lincscloud website is no longer functional. I'll upload this file to figshare.

dhimmel commented 7 years ago

The file is 42.5 GB which exceeds my figshare quota. I sent figshare an email to see if they can make an exception. In the meantime, I'm running an aggressive compression:

xz --extreme -9 --threads=0 --verbose --keep modzs.gctx

I expect the compression ratio to be small however (~15%) since the file is already compressed, hence the x in the .gctx extension.

stuppie commented 7 years ago

Great. Thanks. Can you tell me what this file is exactly? It contains the CD (characteristic directions) / Z-scores / etc for all perturbations for all 22k probes?

Is this data the same as what is present here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138

dhimmel commented 7 years ago

My understanding is that modzs.gctx is the LINCS L1000 data at the SIG stage in the following pipeline:

l1000_data_flow

In other words, modzs.gctx is a LINCS L1000 legacy dataset of differential expression signatures. It contains a matrix of signatures and probes. Each value is a differential expression z-score. This file belongs in the download directory of this repository, but was not uploaded to GitHub due to large file size. See the "Differential Expression (Signature Generation)" section of this help page, for more information on signatures.

modzs.gctx is a file that can be read into python using cmap/l1ktools. The cmap directory of this repository is copied from cmap/l1ktools, with perhaps some small modifications (I forget / should use a submodule next time).

@stuppie does this make sense? If you are just looking for the consensus signatures we generated, you can download those on figshare.

The GEO SuperSeries seems to correspond to the Level 4 data, although I'm not sure what if any differences there are. If you are starting fresh work, I assume the L1000 team would prefer if you use the official GEO datasets. But for reproducibility and extensibility of this repository, I'll work on making modzs.gctx available.

dhimmel commented 7 years ago

I posted modzs.gctx to figshare. Thanks figshare for temporarily raising their file size limit and allowing this upload!

@stuppie, let me know if you need anything else. In general, I advise working with the output datasets from this repository or the raw production LINCS L1000 data from GEO, since I'm not sure if the LINCS L1000 team is still providing support for using modzs.gctx.