ofircohn commented 3 months ago

Hey!

Thank you for this great package! I ran my multiomic data with the stream pipeline and found that it got stuck when generating igraphs.

Specifically in the part below. Any idea?. (I used >10 cores).

Make graphs

return( pbmcapply::pbmclapply(obj.list, function(x) { x.signac <- signac.links[signac.links$node1 %in% x$peaks,] if (nrow(x.signac) < 1) { return(NULL) } if (ifWeighted) { x.signac <- cbind(x.signac, weight = sapply(1:nrow(x.signac), function(i) { xx <- x.signac[i,] sum(atac.dis[xx$node1, x$cells] > 0 & rna.dis[xx$node2, x$cells] > 0) }) ) %>% dplyr::filter(weight > 0) } x.cicero <- coaccess.links[coaccess.links$node1 %in% x$peaks & coaccess.links$node2 %in% x$peaks,] if (ifWeighted) { x.cicero <- cbind(x.cicero, weight = sapply(1:nrow(x.cicero), function(i) { xx <- x.cicero[i,] sum(atac.dis[xx$node1, x$cells] > 0 & atac.dis[xx$node2, x$cells] > 0) }) ) %>% dplyr::filter(weight > 0) }

ig <- igraph::graph_from_data_frame(rbind(x.signac, x.cicero), directed = F)
if (ifWeighted) {
  igraph::E(ig)$weight <- length(x$cells) - c(x.signac$weight, x.cicero$weight)
}

Thank you!

YangLi-Bio commented 2 months ago

Hello!

Thank you for your interest in our STREAM program.

We understand that constructing a weighted heterogeneous graph based on large datasets can be computationally expensive. This is primarily because STREAM needs to process all enhancer-enhancer and enhancer-gene relationships, leading to a significant computational load.

To address this challenge, we employ two strategies. First, we filter highly variable genes using the parameter var.genes and top-ranked peaks with the parameter top.peaks. Second, we construct an unweighted heterogeneous graph. Our performance tests indicate that edge weights have minimal impact on accuracy, with the Steiner forest problem model predominantly determining the prediction of core enhancer-enhancer and enhancer-gene relationships.

To further enhance computational efficiency, we are rewriting the graph construction steps in C++, including integration with Signac and Cicero. We anticipate that this modification will substantially accelerate the process.

We welcome you to continue following and utilizing the STREAM program!

Best regards,

ofircohn commented 1 month ago

Hey! Thank you for this detailed answer! Meanwhile, I reinstalled the package and reduced the number of variable genes to 3000. However, it still seems to get stuck at the same step:

1 clusters are written to rna_counts.txt.chars.blocks [3068.795 seconds elapsed] End QUBIC biclustering ===================

|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s Read co-conditioned cells (n = 25685) in 1 biclusters |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s Read co-regulated genes (n = 3) in 1 biclusters pearson is used to calculate enhancer-gene relations ...

Testing 3000 genes and 240099 peaks Registered S3 method overwritten by 'SeuratDisk': method from as.sparse.H5Group Seurat

Should I generate graphs with a job array to accelerate this?

Thanks!

OSU-BMBL / STREAM

Slow graphs generation #1

Make graphs