Reed-CompBio / spras

Signaling Pathway Reconstruction Analysis Streamliner (SPRAS)
MIT License
11 stars 20 forks source link

Supporting directed edges #107

Closed annaritz closed 10 months ago

annaritz commented 1 year ago

@ntalluri is starting to work on incorporating directed edges, so we're talking about that at our meeting today. We realized there are a number of issues, so we wanted to start a conversation about them.

  1. Some algorithms support directed edges, some don't. Issue #28 suggested that perhaps this isn't an algorithm-level decision (and shouldn't be in the CONFIG file as a user option), but we should decided on how each algorithm supports directed vs. undirected edges. @ntalluri is collecting information about each algorithm, which will be very helpful.
  2. Directed vs. undirected edges is actually a data attribute (the interactome). Further, if a method supports either directed or undirected edges, then this information must be used in the generate_inputs() function to prepare the network files (unless it's passed directly to the algorithm as an argument). For example, if a network is undirected, then RWR should convert it to a bidirected graph before running. If it's directed, RWR doesn't need to convert it. (The decision that RWR runs on a bidirected network is one that we, the developers, made).
  3. Maybe a general table like the one below will help us make these decisions for all algorithms?
Network Type Algorithm Runs on Undirected Only Algorithm Runs on Directed Only Algorithm Runs on Dir or Undir
Undirected Ready to Run Ignore? Make Bi-Directed? Ready to Run
Directed Ignore? Make Undirected? Ready to Run Ready to Run
Mixed TODO TODO TODO
ntalluri commented 1 year ago

Directionality of Algorithms

MEO

Omics Integrator 1

Omics Integrator 2

MinCostFlow

Pathlinker

All Pairs Shortest Paths

Domino

RWR

TieDie

BowTieBuilder

ntalluri commented 1 year ago

APSP

MEO

Pathlinker

Min Cost Flow

Omics Integrator 1

Omics Integrator 2

Domino

RWR:

TieDie:

BowTieBuilder:

ntalluri commented 1 year ago
Network Type Algorithm Runs on Undirected Only Algorithm Runs on Directed Only Algorithm Runs on Dir or Undir
Undirected Ready to Run Make Bi-Directed Ready to Run
Directed Make Undirected Ready to Run Ready to Run
Mixed Make directed edges into undirected (not bidirectional), keep undirected edges as is Make undirected edges into bidirectional edges, keep directed edges as is Ready to Run

Converting from Undirected to Directed graph:

Converting from Directed to Undirected graph:

OVERALL FORMAT -> FORMAT PER ALGORITHM:

MCF (deals with directed):

A B 1 D B C 1 U

->

A B 1 B C 1 C B 1

PathLinker(deals with directed):

A B 1 D B C 1 U

->

A B 1 B C 1 C B 1

MEO(deals with mix): A B 1 D B C 1 U

->

A (pd) B 1 B (pp) C 1

Omics 1(deals with mix):

A B 1 D B C 1 U

-> stays the same

Omics 2(deals with undirected):

A B 1 D B C 1 U

->

A B 1 B C 1

Domino(deals with undirected):

A B 1 D B C 1 U

->

A ppi B B ppi C

RWR(deals with directed): A B 1 D B C 1 U

->

Node1 Node2 Weight A B 1 B C 1 C B 1

TieDie(deals with directed): A B 1 D B C 1 U

->

A -a> B B -a> C C -a> B

BowTieBuilder: N/A

ntalluri commented 1 year ago

Algorithms that support bidirectional edges:

Algorithms that support repeated edges:

Algorithms that don't support bidirectional edges:

Algorithms that don't support repeated edges:

Algorithms unsure of: RWR

ntalluri commented 1 year ago

Input network file format Ideas

First Idea Input file format ideas:

Type 1:

A B 1 C B 1

Type 2:

A B 1 D B C 1 U

Second idea

ntalluri commented 1 year ago

Util.py functions for parse inputs

There is a similarity between completely directed algorithms and a similarity between completely undirected algorithms based on the existing algorithm formats.

It is possible to build two functions:

  1. if three columns, then we need to know if directed or undirected to be able to pass it to the functions to let it know how much it will need to modify the df based on the conversion logic from the table above
  2. if there are 4 columns then we will have to check between the direction marked and then convert to the form needed based on the conversion logic above.

At the moment there is no commonality between the algorithms that deal with mixed graphs; however, in the future there may be.

It is possible to build a function mixedConversion(universal_input, directed_sep, undirected_sep, col_loc(?), directionality) that will convert the universal input into the format MEO uses, or if the column location for the direction column is given, then it can be generalized to any algorithm that uses a format somewhat similar to MEO.

And for Omics Integrator 1, it uses the universal format straight away (there is a chance other algorithms in the future will too)

Future

Algorithm conversion Example

Pathlinker

Omics Integrator 2:

Meo: everything will call mixedConversion:

Omics Integrator 1:

ntalluri commented 1 year ago

Util.py functions for parse outputs

A df that is provided to these functions should be modified to the point where it can call these functions last, with the respected columns (if necessary) still present, so that the direction column can be included in the outputs.

For the fully directed and fully undirected algorithms, we can have two functions in util.py: add_direction_column_directed(df)

add_direction_column_undirected(df)

For the mixed algorithms, we can have 1 function add_direction_column_mixed(df, directed_dlim, undirected_dlim, column_loc(?))

ntalluri commented 1 year ago

Dealing with interpreting the input network

Config.yaml We can put per dataset a graphtype: "mixed"/"undirected"/"directed"

Dataset.py


self.interactome = pd.read_table(os.path.join(data_loc,interactome_loc), header=None)
        num_cols = self.interactome.shape[1]
        if num_cols == 3:
            # TODO: it may be easier for the generate_inputs implementations if this is always transformed to
            # have a default Direction directed or undirected based on what is written in the config file
            self.interactome.columns = ["Interactor1", "Interactor2", "Weight"]
        elif num_cols == 4:
            self.interactome.columns = ["Interactor1", "Interactor2", "Weight", "Direction"]
        else:
            raise ValueError(f'edge_files must have three or four columns but found {num_cols}')

Generate Inputs

agitter commented 1 year ago

Omics Integrator 2 can deal with fully undirected graphs

I looked at its code as well and agree it uses undirected graphs. It has NetworkX Graph objects, which are undirected, and uses pcst_fast solver, which supports undirected graphs.

Now that you have done all of this work, I suggest that your pull request include some of this information in the Class or generate_inputs docstrings of each method. For instance, we can document what you learned about the method's support for undirected, directed, and mixed graphs and the assumptions SPRAS makes when preparing input graphs. We could even document the expected raw input file format.

I agree with your updated table about the behavior for each type of algorithm on each type of graph.

We may want to put the helper functions for converting graphs in Dataset.py instead of util.py. util.py is already crowded. I see two types of operations needed:

I think it will help reusability and implementation to keep those separated. I'm imagining less algorithm-specific logic and suggest making the 3 vs. 4 column input file handling standardized when the file is first loaded by Dataset.py so by the time generate_inputs accesses the dataframe it will always have 4 columns.

In our last meeting we talked about using the config file to specify whether the 3 column format is all directed or undirected edges. That would be a nice feature. However, I propose keeping this simple at first and assuming the 3 column format is always undirected. If there is demand for this extra flexibility, we could easily add it later.

agitter commented 1 year ago

@annaritz we noticed that most of SPRAS currently works with undirected graphs, but PathLinker uses directed graphs. We are currently taking the input graph to SPRAS (which the user likely assumes is undirected) and then passing those edges to PathLinker, which interprets them as directed edges. That seems wrong to me.

@ntalluri's enhancements will fix this behavior, but the current PathLinker wrapper is not ideal in this respect.

Separately, we discussed that there isn't a practical need for a helper function to make a directed graph undirected. The algorithms that would use this instead can simply drop the fourth column with the directionality information.

annaritz commented 1 year ago

Good catch! Yes, you're right. we should be making the (assumed undirected) input graph bidirected for PathLinker.

ntalluri commented 1 year ago

Parse Outputs

Network Type Algorithm Runs on Undirected Only Algorithm Runs on Directed Only Algorithm Runs on Dir or Undir
Undirected Add a column of U's Add a columns of D's Add a columns of U's
Directed Add a columns of U's Add a column of D's Add a column of D's
Mixed Add a column of U's Add a column of D's Add a D or U based on the directionality separator given back in raw-pathway

MEO:

Omics Integrator 1:

ntalluri commented 1 year ago

Updating ML Post Processing

Two methods to change:

The edges are dealt within for in the index, and the columns are the algorithms. The rest of the methods are not dependent on the edges/index (the first line of code for each method drops the index) so nothing has to change for them.

ml.py

summarize_networks

ensemble_network

ntalluri commented 1 year ago

summary.py and graphspace.py

https://stackoverflow.com/questions/55496703/is-it-possible-to-add-undirected-and-directed-edges-to-a-graph-object-in-network

ntalluri commented 1 year ago

For networkx digraphs and graphs, it doesn't allow for multiple parallel edges. With that in mind, I need to test this on the algorithms to see which ones fail.

example:

A B 1 U B A 1 U

convert to directed A B 1 D B A 1 D A B 1 D B A 1 D

UPDATE: through testing, we know the code won’t crash thus the algortihms can run with inputs of multiple parallel edges; however, for parallel edges, networkx digraph and graph will replace the edge with the most recent of the edges from the input when a parallel edge is encountered.

ntalluri commented 1 year ago

Summary.py

In summary.py, I think the approach is to work exclusively with undirected graphs:

  1. Allowing mixed directionality could potentially inflate the actual number of nodes and edges if any conversions are done, leading to inaccuracies in our graph representation.

  2. The majority of the operations we are interested in performing require the graph to be undirected. This makes me question the initial option to choose between directed and undirected graphs in the run function, especially when the summarize_networks function mandates an undirected graph.

    • As for the summarize_networks function, it was initially set up to create a weighted edge list, but those weights were never actually utilized in any calculations. To accommodate the "Direction" column in the edge list, my plan is to read the edge list as is (nx.read_edgelist) and rename the "rank" attribute to "weight." This way, if we ever decide to incorporate weights, we'll have that data readily available.

I'm still uncertain about the correct approach for this part of the code, but for the time being, focusing on undirected graphs appears to be the most straightforward and useful strategy.