ccb-hms / BioPlex

R-side access to PPI data from Gygi lab
https://ccb-hms.github.io/BioPlex
6 stars 2 forks source link

Pre-compute connected components #16

Closed lgeistlinger closed 2 years ago

lgeistlinger commented 2 years ago

One thing we should probably do is to compute the connected components for the two graphs and then have them stored somewhere (hct116 and hek293T). They take a long time to compute on a laptop.... And you sort of need them to do much graph analysis - most of the algorithms work better (I think) on connected components).

lgeistlinger commented 2 years ago

As small comparison between computing connected components with graphframes, igraph, and Bioconductor's graph package:

> bp.293t <- BioPlex::getBioPlex(cell.line = "293T", version = "3.0")
> bp.gr <- BioPlex::bioplex2graph(bp.293t)
> bp.gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689 
## Number of Edges = 115868

(1) With Bioconductor's graph package:

> system.time(ccgn <- graph::connComp(bp.gr))
   user  system elapsed 
291.250   7.533 298.959 
# Number of components
> length(ccgn)
[1] 15
# Number of nodes in each component
> lengths(ccgn)
 [1] 13661     2     2     2     2     2     2     2     2     2     2     2
[13]     2     2     2

(2) With graphframes (using a local spark connection sc):

> gf <- BioPlexAnalysis::[graph2graphframe](https://ccb-hms.github.io/BioPlexAnalysis/reference/graph2graphframe.html)(bp.gr, sc)
> gf
## GraphFrame
## Vertices:
##   Database: spark_connection
##   $ id       <chr> "P00813", "Q8N7W2", "Q6ZMN8", "P20138", "P55039", "Q17R55", "…
##   $ entrezid <chr> "100", "222389", "645121", "945", "1819", "148109", "54363", …
##   $ symbol   <chr> "ADA", "BEND7", "CCNI2", "CD33", "DRG2", "FAM187B", "HAO1", "…
##   $ isoform  <chr> "P00813", "Q8N7W2-2", "Q6ZMN8", "P20138", "P55039", "Q17R55",…
## Edges:
##   Database: spark_connection
##   $ src  <chr> "P00813", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7…
##   $ dst  <chr> "A5A3E0", "P26373", "Q09028", "Q9Y3U8", "P36578", "P23396", "Q070…
##   $ pW   <dbl> 6.881844e-10, 1.340380e-18, 7.221401e-21, 7.058372e-17, 1.632313e…
##   $ pNI  <dbl> 1.176357e-04, 2.256645e-01, 6.416690e-05, 1.281827e-01, 2.006379e…
##   $ pInt <dbl> 0.9998824, 0.7743355, 0.9999358, 0.8718173, 0.7993621, 0.9989736,…
> spark_set_checkpoint_dir(sc, tempdir()) 
> system.time(cc <- gf_connected_components(gf))
   user  system elapsed 
  0.056   0.003   9.825
> ccd <- data.frame(cc) 
# Number of nodes in each component
> table(ccd$component)
          0         412  8589934790  8589935412 25769804245 25769804437 
      13661           2           2           2           2           2 
42949673590 51539607674 51539608004 60129542431 60129542493 68719476747 
          2           2           2           2           2           2 
68719476964 68719477146 94489280821 
          2           2           2

(3) With igraph:

> ig <- igraph::graph_from_graphnel(bp.gr)
> system.time(cci <- igraph::components(ig))
   user  system elapsed 
  0.007   0.002   0.008 
# Number of connected components
> cci$no
[1] 15
# Number of nodes in each components
> cci$csize
 [1] 13661     2     2     2     2     2     2     2     2     2     2     2
[13]     2     2     2