cmap / cmapPy

Assorted tools for interacting with .gct, .gctx files and other Connectivity Map (Broad Institute) data/tools
https://clue.io/cmapPy/index.html
BSD 3-Clause "New" or "Revised" License
126 stars 76 forks source link

gmt.py #8

Closed levlitichev closed 7 years ago

levlitichev commented 7 years ago

Added a module for parsing GMTs. I need to parse a GMT for changes I want to make to evaluate_sets.py, and I figured it would be worth adding here while I'm at it.

I followed Corey's model of representing GMTs: a GMT is a list of dictionaries, where each dictionary corresponds to each line of the GMT file, and each dictionary has the following fields: head, desc, entry (paralleling Matlab).

Let me know what you guys think. (There are a few other minor changes in this pull request as well.)

oena commented 7 years ago

@levlitichev Mostly looks good to me (thanks for writing tests too! 😉)-- the only part I'm not clear on were the changes you made to concat_gctoo (wrote a comment in the place where I'm confused). Can you clarify please?

dllahr commented 7 years ago

Can we / have we checked with @rajivnarayan and @tnat1031 about the spec and structure of the in-data memory/object that is created when Matlab / R parse GMT's? If it matches the above, great! Otherwise we should be consistent.

levlitichev commented 7 years ago

Good call. I'll check with them.

oena commented 7 years ago

Also @levlitichev @dllahr, this is a fairly minor point but the reason why I didn't include the GMT/GRP code initially was to limit the scope of this submodule to GCT/X files only. So just to clarify, are we all good with expanding?

levlitichev commented 7 years ago

I think I remember that discussion, but either way, it looks like plategrp.py snuck into the repo. I can't think of a better place for these scripts besides maybe adding a module that would sit next to pandasGEXpress and clue_api_client.

oena commented 7 years ago

And, while we're having philosophical discussions--

@levlitichev and I were just talking about whether the (current) list of dictionaries structure is really the best data structure to use for these. So that we're all on the same page/can discuss more widely, the pros and cons we found were:

Pros (keep it the way it is):

Cons (change it):

Thoughts? Lev, feel free to revise anything I got wrong.

tnat1031 commented 7 years ago

@oena @levlitichev I agree that it would be nice to have the ability to access an individual item in a GMT object given that item's UID. I think since we're already tweaking the code and it's not a dramatic change, I'd vote for making the change.

levlitichev commented 7 years ago

@oena brings up a good alternate structure. I did a side-by-side of a few common tasks that we might want to do with a GMT to see how they stack up.

"""
Option 1: the old way

A GMT is stored as a list of dictionaries.
Each line is its own dictionary.
Each dictionary has the following keys:
    - head (string): identifier for the set
    - desc (string): longer description of the set
    - entries (list): members of the set

Option 2: proposed new way

A GMT is stored as a dictionary, where the keys are ids, and the values are
dictionaries. The values could also be tuples, but I think dictionary is
better so that you can refer to fields by name, rather than remembering
what's what in the tuple.

The nested dictionaries have the following keys:
    - desc (string): longer description of the set
       - entries (list): members of the set
"""

gmt1 = [{"head": "A", "desc": "this one is A", "entry": ["a1", "a3", "a2"]},
        {"head": "B", "desc": "this one is B", "entry": ["b4", "b2", "b3"]}]
gmt2 = {"A": {"desc": "this one is A", "entry": ["a1", "a3", "a2"]},
        "B": {"desc": "this one is B", "entry": ["b4", "b2", "b3"]}}

# Get names of sets
print [g["head"] for g in gmt1]
print gmt2.keys()

# Get descriptions of sets
print [g["desc"] for g in gmt1]
print [g["desc"] for g in gmt2.itervalues()]

# Get entries in sets
print [g["entry"] for g in gmt1]
print [g["entry"] for g in gmt2.itervalues()]

# Get lengths of sets
print [len(g["entry"]) for g in gmt1] # [g["length"]) for g in gmt1]
print [len(g["entry"]) for g in gmt2.itervalues()] # [g["length"]) for g in gmt2.itervalues()]

# Sort each set and return result as dict
dict1 = {}
for g in gmt1:
    dict1[g["head"]] = sorted(g["entry"])
print dict1

dict2 = {}
for set_name, set_rest in gmt2.iteritems():
    dict2[set_name] = sorted(set_rest["entry"])
print dict2

N.B. @tnat1031 , you'll notice that in this comparison, I have not introduced 'length' as another field for two reasons: 1) it's quite easy to extract length using the 'entry' field, and 2) I can see annoyances down the line in having to update the 'length' field if/when the 'entry' field is updated (and not worrying about whether the two are in sync doesn't seem like a good move either). But I included in the comments what the syntax would be if we did include it as a field.

Let me know what you all think.

levlitichev commented 7 years ago

Some brief commentary: it is really nice to pull out the set identifiers using .keys, but the other tasks seem to have been made more difficult. I may also be missing a more elegant way to mess around with dictionaries. So my inclination is to follow the original approach. Looking forward to others' thoughts.

dllahr commented 7 years ago

Are we reinventing the wheel here? Is there a reason we're using a different structure than r or Matlab? If we choose a different structure here are we going to update the others to conform?

On Aug 10, 2017 11:11 AM, "Lev Litichevskiy" notifications@github.com wrote:

Some brief commentary: it is really nice to pull out the set identifiers using .keys, but the other tasks seem to have been made more difficult. I may also be missing a more elegant way to mess around with dictionaries. So my inclination is to follow the original approach. Looking forward to others' thoughts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmap/cmapPy/pull/8#issuecomment-321581074, or mute the thread https://github.com/notifications/unsubscribe-auth/ABfjIgUPSH7gAAl-4KJ9Er0bEuozAcsZks5sWx2wgaJpZM4OyPCj .

levlitichev commented 7 years ago

In Matlab, it's a struct array that stores cell arrays. The 'head' and 'desc' fields are cell arrays of strings, the 'entry' field is a cell array of cell arrays, and the 'len' field is a cell array of integers. As I understand, the Python equivalent is a dictionary containing lists of strings ('head' and 'desc'), lists of lists ('entry'), and lists of integers ('length').

With @oena, we just checked out R. My understanding is the following: we have a list of lists of lists. The outermost list is equal to the number of lines in the GMT. The next layer (i.e. each corresponding to a single line) is a named list where the names are 'head', 'desc', 'entry', and 'len' . Since you can't just store 'head' and 'desc' as strings, they're stored as single-element lists, and then 'entry' is a list with several elements. 'length' is a single-element list containing an integer.

The main difference is that in R, the outermost container is a list whose length is equal to the number of sets (similar to option 1 above). In Matlab, it's a dictionary storing lists, where each list's length equals the number of sets (no equivalent option proposed).

So... I dunno. Happy to discuss in person when you get back, Dave. @tnat1031 and @rajivnarayan, feel free to chime in.

tnat1031 commented 7 years ago

R is a little unique because lists can also act as dictionaries in that it's possible to access individual elements by ID or by index. So in that sense the current GMT implementation can be used in similar ways as both the current and proposed Python implementations.

Thinking more generally, while it would be nice to have these implementations behave the same way in multiple languages, it might also be worth thinking about whether each language-specific implementation takes best advantage of that language's capabilities (or mitigates limitations).

The dictionary option (option 2) doesn't seem too much more complicated than the list of dictionaries option, plus it also lets you access any individual item by ID, which I think is a pretty common use case. So my vote would be option 2.

On Thu, Aug 10, 2017 at 3:25 PM Lev Litichevskiy notifications@github.com wrote:

In Matlab, it's a struct array that stores cell arrays. The 'head' and 'desc' fields are cell arrays of strings, the 'entry' field is a cell array of cell arrays, and the 'len' field is a cell array of integers. As I understand, the Python equivalent is a dictionary containing lists of strings ('head' and 'desc'), lists of lists ('entry'), and lists of integers ('length').

With @oena https://github.com/oena, we just checked out R. My understanding is the following: we have a list of lists of lists. The outermost list is equal to the number of lines in the GMT. The next layer (i.e. each corresponding to a single line) is a named list where the names are 'head', 'desc', 'entry', and 'len' . Since you can't just store 'head' and 'desc' as strings, they're stored as single-element lists, and then 'entry' is a list with several elements. 'length' is a single-element list containing an integer.

The main difference is that in R, the outermost container is a list whose length is equal to the number of sets (similar to option 1 above). In Matlab, it's a dictionary storing lists, where each list's length equals the number of sets (no equivalent option proposed).

So... I dunno. Happy to discuss in person when you get back, Dave. @tnat1031 https://github.com/tnat1031 and @rajivnarayan https://github.com/rajivnarayan, feel free to chime in.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/cmap/cmapPy/pull/8#issuecomment-321649357, or mute the thread https://github.com/notifications/unsubscribe-auth/AB3i74iEgUfRAT7U2BYq_DKj7pOfH2vCks5sW1kogaJpZM4OyPCj .

levlitichev commented 7 years ago

@oena I've made the changes that we discussed and also updated the README. Please make any other changes if necessary! Thanks a lot.