cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
126 stars 76 forks source link

Wired performance in parsing gctx file #1

Closed pittacus closed 7 years ago

pittacus commented 7 years ago

In jupyter notebook cmapPy_pandasGEXpress_tutorial.ipynb, I try the following codes:

from cmapPy.pandasGEXpress import parse
gctx_fn="GSE92742_Broad_LINCS_Level2_GEX_delta_n49216x978.gctx"
time my_col_metadata = parse(gctx_fn, meta_only=True)
time gctx = parse(gctx_fn)

But the result in my computer is so strange. It looks like the execution time of parse(gctx_fn, meta_only=True) is too much more slower than parse(gctx_fn). Here is the result on my side.

time time my_col_metadata = parse(gctx_fn, meta_only=True) CPU times: user 4.06 s, sys: 462 ms, total: 4.52 s Wall time: 4.52 s

time gctx = parse(gctx_fn) CPU times: user 67.2 ms, sys: 166 ms, total: 234 ms Wall time: 887 ms

oena commented 7 years ago

Hi @pittacus,

Hm, that's certainly an unexpected result. I'll look into it and get back to you; give me a couple days because I'm out for the next two.

-Oana

oena commented 7 years ago

Hi @pittacus,

Two things:

1) Cause of increased parse time for metadata only

After timing different parts of parse, here's what I found: the increased time for reading in only the metadata to a GCToo instance is because (for compatibility with other code) when meta_only = True, the data_df attribute of a GCToo instance is actually an empty pandas DataFrame (with index = row/gene ids, and columns = column/sample ids). It seems that in this case, pandas making a DataFrame of NaNs takes longer than just reading in the data matrix from the original gctx file. Does that make sense?

I do agree that it's an unusual behavior; at least in my analyses of this, the pattern you observed no longer holds true for larger data matrices. However, for now we'd like to keep compatibility with other code (and I can't do much about pandas DataFrame's performance) so unfortunately I can't fix this at the moment.

2) That being said, the code you show above doesn't actually get you the metadata you want!

You're actually only reading in the row & column ids because for the GEO data, we've deposited metadata separate from the data matrices (long story) as tab-delimited text files. If you can clarify what you're trying to do, I'm happy to point you to the appropriate metadata (there are 7 different metadata tables available at the GSE92742 accession). Otherwise/regardless, I recently made a tutorial specific to interacting with GEO data; the Table of Contents links aren't working properly on GitHub at the moment, but the section relevant to your workflow is Parse in metadata only/How to parse in metadata only from GEO.

Hope that helps! Thanks for your feedback and definitely let me know if you run into any other issues/bugs.

Best, Oana