Closed annaritz closed 10 months ago
Directionality of Algorithms
MEO
Omics Integrator 1
Omics Integrator 2
edges
: a 2D int64 array. Each row (of length 2) specifies an undirected edge in the input graph. The nodes are labeled 0 to n-1, where n is the number of nodes. “MinCostFlow
Pathlinker
All Pairs Shortest Paths
Domino
RWR
TieDie
BowTieBuilder
APSP
MEO
Pathlinker
Min Cost Flow
Omics Integrator 1
Omics Integrator 2
Domino
RWR:
TieDie:
BowTieBuilder:
Network Type | Algorithm Runs on Undirected Only | Algorithm Runs on Directed Only | Algorithm Runs on Dir or Undir |
---|---|---|---|
Undirected | Ready to Run | Make Bi-Directed | Ready to Run |
Directed | Make Undirected | Ready to Run | Ready to Run |
Mixed | Make directed edges into undirected (not bidirectional), keep undirected edges as is | Make undirected edges into bidirectional edges, keep directed edges as is | Ready to Run |
Converting from Undirected to Directed graph:
Converting from Directed to Undirected graph:
MCF (deals with directed):
A B 1 D B C 1 U
->
A B 1 B C 1 C B 1
PathLinker(deals with directed):
A B 1 D B C 1 U
->
A B 1 B C 1 C B 1
MEO(deals with mix): A B 1 D B C 1 U
->
A (pd) B 1 B (pp) C 1
Omics 1(deals with mix):
A B 1 D B C 1 U
-> stays the same
Omics 2(deals with undirected):
A B 1 D B C 1 U
->
A B 1 B C 1
Domino(deals with undirected):
A B 1 D B C 1 U
->
A ppi B B ppi C
RWR(deals with directed): A B 1 D B C 1 U
->
Node1 Node2 Weight A B 1 B C 1 C B 1
TieDie(deals with directed): A B 1 D B C 1 U
->
A -a> B B -a> C C -a> B
BowTieBuilder: N/A
Algorithms that support bidirectional edges:
Algorithms that support repeated edges:
Algorithms that don't support bidirectional edges:
Algorithms that don't support repeated edges:
Algorithms unsure of: RWR
First Idea Input file format ideas:
Type 1:
A B 1 C B 1
Type 2:
A B 1 D B C 1 U
Second idea
There is a similarity between completely directed algorithms and a similarity between completely undirected algorithms based on the existing algorithm formats.
It is possible to build two functions:
At the moment there is no commonality between the algorithms that deal with mixed graphs; however, in the future there may be.
It is possible to build a function mixedConversion(universal_input, directed_sep, undirected_sep, col_loc(?), directionality) that will convert the universal input into the format MEO uses, or if the column location for the direction column is given, then it can be generalized to any algorithm that uses a format somewhat similar to MEO.
And for Omics Integrator 1, it uses the universal format straight away (there is a chance other algorithms in the future will too)
Future
Algorithm conversion Example
Pathlinker
Omics Integrator 2:
Meo: everything will call mixedConversion:
Omics Integrator 1:
A df that is provided to these functions should be modified to the point where it can call these functions last, with the respected columns (if necessary) still present, so that the direction column can be included in the outputs.
For the fully directed and fully undirected algorithms, we can have two functions in util.py: add_direction_column_directed(df)
add_direction_column_undirected(df)
For the mixed algorithms, we can have 1 function add_direction_column_mixed(df, directed_dlim, undirected_dlim, column_loc(?))
Config.yaml We can put per dataset a graphtype: "mixed"/"undirected"/"directed"
Dataset.py
self.interactome = pd.read_table(os.path.join(data_loc,interactome_loc), header=None)
num_cols = self.interactome.shape[1]
if num_cols == 3:
# TODO: it may be easier for the generate_inputs implementations if this is always transformed to
# have a default Direction directed or undirected based on what is written in the config file
self.interactome.columns = ["Interactor1", "Interactor2", "Weight"]
elif num_cols == 4:
self.interactome.columns = ["Interactor1", "Interactor2", "Weight", "Direction"]
else:
raise ValueError(f'edge_files must have three or four columns but found {num_cols}')
Generate Inputs
Omics Integrator 2 can deal with fully undirected graphs
I looked at its code as well and agree it uses undirected graphs. It has NetworkX Graph objects, which are undirected, and uses pcst_fast solver, which supports undirected graphs.
Now that you have done all of this work, I suggest that your pull request include some of this information in the Class or generate_inputs
docstrings of each method. For instance, we can document what you learned about the method's support for undirected, directed, and mixed graphs and the assumptions SPRAS makes when preparing input graphs. We could even document the expected raw input file format.
I agree with your updated table about the behavior for each type of algorithm on each type of graph.
We may want to put the helper functions for converting graphs in Dataset.py
instead of util.py
. util.py
is already crowded. I see two types of operations needed:
I think it will help reusability and implementation to keep those separated. I'm imagining less algorithm-specific logic and suggest making the 3 vs. 4 column input file handling standardized when the file is first loaded by Dataset.py
so by the time generate_inputs
accesses the dataframe it will always have 4 columns.
In our last meeting we talked about using the config file to specify whether the 3 column format is all directed or undirected edges. That would be a nice feature. However, I propose keeping this simple at first and assuming the 3 column format is always undirected. If there is demand for this extra flexibility, we could easily add it later.
@annaritz we noticed that most of SPRAS currently works with undirected graphs, but PathLinker uses directed graphs. We are currently taking the input graph to SPRAS (which the user likely assumes is undirected) and then passing those edges to PathLinker, which interprets them as directed edges. That seems wrong to me.
@ntalluri's enhancements will fix this behavior, but the current PathLinker wrapper is not ideal in this respect.
Separately, we discussed that there isn't a practical need for a helper function to make a directed graph undirected. The algorithms that would use this instead can simply drop the fourth column with the directionality information.
Good catch! Yes, you're right. we should be making the (assumed undirected) input graph bidirected for PathLinker.
Network Type | Algorithm Runs on Undirected Only | Algorithm Runs on Directed Only | Algorithm Runs on Dir or Undir |
---|---|---|---|
Undirected | Add a column of U's | Add a columns of D's | Add a columns of U's |
Directed | Add a columns of U's | Add a column of D's | Add a column of D's |
Mixed | Add a column of U's | Add a column of D's | Add a D or U based on the directionality separator given back in raw-pathway |
MEO:
Omics Integrator 1:
Two methods to change:
The edges are dealt within for in the index, and the columns are the algorithms. The rest of the methods are not dependent on the edges/index (the first line of code for each method drops the index) so nothing has to change for them.
ml.py
summarize_networks
ensemble_network
For networkx digraphs and graphs, it doesn't allow for multiple parallel edges. With that in mind, I need to test this on the algorithms to see which ones fail.
example:
A B 1 U B A 1 U
convert to directed A B 1 D B A 1 D A B 1 D B A 1 D
UPDATE: through testing, we know the code won’t crash thus the algortihms can run with inputs of multiple parallel edges; however, for parallel edges, networkx digraph and graph will replace the edge with the most recent of the edges from the input when a parallel edge is encountered.
In summary.py, I think the approach is to work exclusively with undirected graphs:
Allowing mixed directionality could potentially inflate the actual number of nodes and edges if any conversions are done, leading to inaccuracies in our graph representation.
The majority of the operations we are interested in performing require the graph to be undirected. This makes me question the initial option to choose between directed and undirected graphs in the run function, especially when the summarize_networks function mandates an undirected graph.
I'm still uncertain about the correct approach for this part of the code, but for the time being, focusing on undirected graphs appears to be the most straightforward and useful strategy.
@ntalluri is starting to work on incorporating directed edges, so we're talking about that at our meeting today. We realized there are a number of issues, so we wanted to start a conversation about them.
generate_inputs()
function to prepare the network files (unless it's passed directly to the algorithm as an argument). For example, if a network is undirected, thenRWR
should convert it to a bidirected graph before running. If it's directed,RWR
doesn't need to convert it. (The decision thatRWR
runs on a bidirected network is one that we, the developers, made).