Closed lgeistlinger closed 2 years ago
As small comparison between computing connected components with graphframes
, igraph
, and Bioconductor's graph
package:
> bp.293t <- BioPlex::getBioPlex(cell.line = "293T", version = "3.0")
> bp.gr <- BioPlex::bioplex2graph(bp.293t)
> bp.gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689
## Number of Edges = 115868
(1) With Bioconductor's graph
package:
> system.time(ccgn <- graph::connComp(bp.gr))
user system elapsed
291.250 7.533 298.959
# Number of components
> length(ccgn)
[1] 15
# Number of nodes in each component
> lengths(ccgn)
[1] 13661 2 2 2 2 2 2 2 2 2 2 2
[13] 2 2 2
(2) With graphframes
(using a local spark connection sc
):
> gf <- BioPlexAnalysis::[graph2graphframe](https://ccb-hms.github.io/BioPlexAnalysis/reference/graph2graphframe.html)(bp.gr, sc)
> gf
## GraphFrame
## Vertices:
## Database: spark_connection
## $ id <chr> "P00813", "Q8N7W2", "Q6ZMN8", "P20138", "P55039", "Q17R55", "…
## $ entrezid <chr> "100", "222389", "645121", "945", "1819", "148109", "54363", …
## $ symbol <chr> "ADA", "BEND7", "CCNI2", "CD33", "DRG2", "FAM187B", "HAO1", "…
## $ isoform <chr> "P00813", "Q8N7W2-2", "Q6ZMN8", "P20138", "P55039", "Q17R55",…
## Edges:
## Database: spark_connection
## $ src <chr> "P00813", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7…
## $ dst <chr> "A5A3E0", "P26373", "Q09028", "Q9Y3U8", "P36578", "P23396", "Q070…
## $ pW <dbl> 6.881844e-10, 1.340380e-18, 7.221401e-21, 7.058372e-17, 1.632313e…
## $ pNI <dbl> 1.176357e-04, 2.256645e-01, 6.416690e-05, 1.281827e-01, 2.006379e…
## $ pInt <dbl> 0.9998824, 0.7743355, 0.9999358, 0.8718173, 0.7993621, 0.9989736,…
> spark_set_checkpoint_dir(sc, tempdir())
> system.time(cc <- gf_connected_components(gf))
user system elapsed
0.056 0.003 9.825
> ccd <- data.frame(cc)
# Number of nodes in each component
> table(ccd$component)
0 412 8589934790 8589935412 25769804245 25769804437
13661 2 2 2 2 2
42949673590 51539607674 51539608004 60129542431 60129542493 68719476747
2 2 2 2 2 2
68719476964 68719477146 94489280821
2 2 2
(3) With igraph
:
> ig <- igraph::graph_from_graphnel(bp.gr)
> system.time(cci <- igraph::components(ig))
user system elapsed
0.007 0.002 0.008
# Number of connected components
> cci$no
[1] 15
# Number of nodes in each components
> cci$csize
[1] 13661 2 2 2 2 2 2 2 2 2 2 2
[13] 2 2 2
One thing we should probably do is to compute the connected components for the two graphs and then have them stored somewhere (hct116 and hek293T). They take a long time to compute on a laptop.... And you sort of need them to do much graph analysis - most of the algorithms work better (I think) on connected components).