a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Protein graphs #38

Closed a-r-j closed 3 years ago

a-r-j commented 3 years ago

Did some refactoring to the protein graph construction.

I cleaned up the high-level graph construction function and refactored some of process_dataframe such that the various steps are their own functions.

There are a couple things I'd really appreciate your take on, @ericmjl :

  1. Refactoring the dataframe processing to support users providing a list of functions that operate on atom/hetatom dataframes in a manner similar to the metatdata annotation family of functions.

  2. Having high-level functions that can be used with just a config object, but also have additional optional arguments that override the config. At the moment, I've partially done this in construc_graph for the metadata funcs. The flow is to load a default config if none is provided and then optionally overwrite the function list parameters if they are provided. Do you think it makes sense to have this available for all the config parameters? It would blow up the number of arguments that construct_graph would take, but they would all be optional.

a-r-j commented 3 years ago

I've added support in process_dataframe to provide lists of functions to process the atoms and hetatms dfs. If these are provided, they will do all the processing. If they are not, the default workflow will execute.

I decided to leave the default workflow in place for now. This way I think it's useful for high-level users as it makes the config more apparent, instead of them having to correctly partial a bunch of functions and sequence them. This would remove a lot of the oversight that the config object provides.

a-r-j commented 3 years ago

I found it tricky to elegantly refactor the functions that operate on sequences to work with both protein graphs and ppi graphs. I settled on this: what do you think?

def molecular_weight(input, seq_type="protein"):
    from Bio import SeqUtils

    func = partial(SeqUtils.molecular_weight, seq_type=seq_type)

    # If a graph is provided, e.g. from a protein graph we compute the function over the chains
    if isinstance(input, nx.Graph):
        G = compute_feature_over_chains(
            input, func, feature_name="molecular_weight"
        )
        return G

    # If a node is provided, e.g. from a PPI graph we extract the sequence and compute the weight
    elif type(input) == str:
        return func(input)