Closed pittacus closed 7 years ago
Hi @pittacus,
Hm, that's certainly an unexpected result. I'll look into it and get back to you; give me a couple days because I'm out for the next two.
-Oana
Hi @pittacus,
Two things:
1) Cause of increased parse time for metadata only
After timing different parts of parse
, here's what I found: the increased time for reading in only the metadata to a GCToo instance is because (for compatibility with other code) when meta_only = True
, the data_df
attribute of a GCToo instance is actually an empty pandas DataFrame (with index = row/gene ids, and columns = column/sample ids). It seems that in this case, pandas making a DataFrame of NaNs takes longer than just reading in the data matrix from the original gctx file. Does that make sense?
I do agree that it's an unusual behavior; at least in my analyses of this, the pattern you observed no longer holds true for larger data matrices. However, for now we'd like to keep compatibility with other code (and I can't do much about pandas DataFrame's performance) so unfortunately I can't fix this at the moment.
2) That being said, the code you show above doesn't actually get you the metadata you want!
You're actually only reading in the row & column ids because for the GEO data, we've deposited metadata separate from the data matrices (long story) as tab-delimited text files. If you can clarify what you're trying to do, I'm happy to point you to the appropriate metadata (there are 7 different metadata tables available at the GSE92742 accession). Otherwise/regardless, I recently made a tutorial specific to interacting with GEO data; the Table of Contents links aren't working properly on GitHub at the moment, but the section relevant to your workflow is Parse in metadata only/How to parse in metadata only from GEO.
Hope that helps! Thanks for your feedback and definitely let me know if you run into any other issues/bugs.
Best, Oana
In jupyter notebook cmapPy_pandasGEXpress_tutorial.ipynb, I try the following codes:
But the result in my computer is so strange. It looks like the execution time of parse(gctx_fn, meta_only=True) is too much more slower than parse(gctx_fn). Here is the result on my side.