cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
124 stars 74 forks source link

parse_gctx: don't sort returned values #36

Open dllahr opened 6 years ago

dllahr commented 6 years ago

Hi @oena @levlitichev

I was thinking about doing a pull request where I modified parse_gctx to not return the dataframes sorted by index/column. The reason I propose this is if you read them out and get them in the order that they appear in the file, you can then choose the ones you are interested in, figure out their index id, and then use the ridx/cidx option to load them, which is much faster.

Also, could make it an option to do the sort. What do you think?

oena commented 6 years ago

Hi @dllahr! Not sure I totally follow. Do you mean just for the metadata only options? Otherwise the IDs are subsetted before hyperslab selection occurs.

dllahr commented 6 years ago

Sorry, no I mean that right now when you get the metadata back (and I think when you get it all back) all of the ID's have been sorted. The use-case I ran into was:

  1. got just the row metadata back
  2. identified the overlap between the genes I wanted an those that were present
  3. identified the indices of the genes I wanted in the row metadata
  4. attempted to load using the ridx option, got a completely different set of genes
  5. realized that the row metadata had been sorted, rather than returned as it appears in the file
oena commented 6 years ago

Ok, gotcha. That does seem like a useful thing to do. Maybe we can start with having it as an option and see how things go?

saksham219 commented 5 years ago

@oena Has this been taken up? I was thinking of working on this.

oena commented 5 years ago

@dllahr did you follow up on this? No worries if not, just checking

ghost commented 5 years ago

Sorry I'm late replying, I did not get to doing anything with this.