buenrostrolab / FigR

Functional Inference of Gene Regulation
https://buenrostrolab.github.io/FigR/
MIT License
31 stars 10 forks source link

How can I generate the cellKNN.mat? #19

Closed sylestiel closed 1 year ago

sylestiel commented 1 year ago

@vkartha

dorcMat.smooth <- smoothScoresNN(NNmat=cellKNN.mat,mat=dorcMat,nCores=10) Error in nrow(NNmat) : object 'cellKNN.mat' not found

Thanks!

vkartha commented 1 year ago

Hi there, there is a small snippet in the documentation website vignette on how this can be determined. Essentially, use the same dimensionality reduction (PC / LSI) components that were used to derive your 2D projection / cell clusters as input and construct a cell x neighbors (index) matrix:

# If using LSI (ATAC)
# nComponents is usually 30/50 depending on what you computed
nComponents <- 50
lsi <- Signac::atac_small@reductions$lsi@cell.embeddings[,1:nComponents]
set.seed(123)
cellkNN.mat <- FNN::get.knn(lsi,k = 30)$nn.index
dim(cellkNN.mat)
rownames(cellkNN.mat) <- rownames(lsi)
sylestiel commented 1 year ago

Thank you! Does it matter if we are choosing to use LSI or PCA? Having problems running

Screen Shot 2023-01-18 at 2 17 15 PM

dorcMat.smooth <- smoothScoresNN(cellkNN.mat,colnames(dorcMat),nCores=1)

Screen Shot 2023-01-18 at 5 07 38 PM

Can't get the smoothing functions to run dorcMat.s <- smoothScoresNN(cellkNN.mat[,1:30],mat = dorcMat,nCores = 4)

dorcMat.s <- smoothScoresNN(NNmat = cellkNN[,1:30],mat = dorcMat,nCores = 4)

RNAmat.s <- smoothScoresNN(NNmat = cellkNN[,1:30],mat = RNAmat,nCores = 4)

yjchen1201 commented 1 year ago

Thank you for developing this great tool! To follow up @sylestiel 's question-- "does it matter if we are choosing to use LSI or PCA?". Especially when we are dealing with scMultiome data (RNA+ATAC). I followed Signac/Seurat, PCA is for RNA part, LSI is for ATAC part. In this case, how should we choose which one to use? Should we choose based on what we used for getting the final 2D projection UMAP? What if the final 2D projection UMAP/cell clustering are based on WNN corrected result (https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html)? Do you have any comment/suggestion? Thank you!

vkartha commented 1 year ago

Ya, that's a v good question. In general, since we are doing correlative analysis (tying two assays together), being as unbiased as possible within each assay helps (i.e. smooth each assay by it's own nearest neighbors). Definitely use what led you to getting the 2D projection because in the end I'm assuming you want to tie the scores back to your projected and visualized cell states/clusters (if say you use different number of PCs or a different embedding altogether, you may see weird things on your original UMAP). For WNN, I believe it's possible to smooth both RNA/ATAC by the WNN matrix that is derived:

seu.multi <- FindMultiModalNeighbors(
  seu.multi, reduction.list = list("pca", "lsi"), 
  dims.list = list(1:30, 1:30), modality.weight.name = "RNA.weight",
  k.nn=30
)
nn.wnn = seu.multi@neighbors$weighted.nn

# Smoothing RNA gene expression using WNN neighbors
smoothScores <- smoothScoresNN(NNmat = nn.wnn, mat = seu.multi@assays$SCT@data, geneList = c('GATA2','IRF4','IRF8'))

The issue arises when either of your assays (ATAC/RNA) has a relatively low depth or per-cell yied, so then smoothing the deeper modality by (say) the more shallowly profiled assay can lead to issues (basically your KNN itself is noisy), where as doing the other way around (if multi-modal data from the same single cells) can help. WNN can also suffer in this case, so the short answer is it depends. We did not extensively test the effects of different smoothing techniques / neighborhood sizes etc, but it would be good to hash out in the future. Since we do background computations, I'd like to think it only matters so much, but the main reason we do it is since we utilize correlation coefficient computations, dealing with dropout (0's) helps reduce the noise.

vkartha commented 1 year ago

@sylestiel Can you paste what your cellKNN.mat looks like? (run head(cellKNN.mat). The command that was just preceding that error statement, you've provided cellKNN as the NNmat parameter, but your variable was actually called cellKNN.mat?

sylestiel commented 1 year ago

@vkartha

Screen Shot 2023-01-19 at 3 04 06 PM Screen Shot 2023-01-19 at 3 05 34 PM

The smoothScoresNN script with NNmat=n.wnn is not working. Besides I would like to use the dorcMat and not seu.multi@assays$SCT@data for mat parameter. Also does running the about script with FindMultiModalNeighbors obviate the need for establishing cellkNNmat?

yjchen1201 commented 1 year ago

Ya, that's a v good question. In general, since we are doing correlative analysis (tying two assays together), being as unbiased as possible within each assay helps (i.e. smooth each assay by it's own nearest neighbors). Definitely use what led you to getting the 2D projection because in the end I'm assuming you want to tie the scores back to your projected and visualized cell states/clusters (if say you use different number of PCs or a different embedding altogether, you may see weird things on your original UMAP). For WNN, I believe it's possible to smooth both RNA/ATAC by the WNN matrix that is derived:

seu.multi <- FindMultiModalNeighbors(
  seu.multi, reduction.list = list("pca", "lsi"), 
  dims.list = list(1:30, 1:30), modality.weight.name = "RNA.weight",
  k.nn=30
)
nn.wnn = seu.multi@neighbors$weighted.nn

# Smoothing RNA gene expression using WNN neighbors
smoothScores <- smoothScoresNN(NNmat = nn.wnn, mat = seu.multi@assays$SCT@data, geneList = c('GATA2','IRF4','IRF8'))

The issue arises when either of your assays (ATAC/RNA) has a relatively low depth or per-cell yied, so then smoothing the deeper modality by (say) the more shallowly profiled assay can lead to issues (basically your KNN itself is noisy), where as doing the other way around (if multi-modal data from the same single cells) can help. WNN can also suffer in this case, so the short answer is it depends. We did not extensively test the effects of different smoothing techniques / neighborhood sizes etc, but it would be good to hash out in the future. Since we do background computations, I'd like to think it only matters so much, but the main reason we do it is since we utilize correlation coefficient computations, dealing with dropout (0's) helps reduce the noise.

Thank you @vkartha for your fast reply! I will try this way out :)!

vkartha commented 1 year ago

@sylestiel can you also print head(dorcMat) (looks like you were able to generate the matrix OK). That cellKNN looks correct, it's just a matrix of neighbor indices, each row being a cell barcode that is matched on the colnames of dorcMat internally.

sylestiel commented 1 year ago
Screen Shot 2023-01-20 at 10 09 37 AM

@vkartha Hi suspect it may be the disparity with the object I'm reading in (multi) from Signac as opposed to the object used to obtain dorcMat (ATAC.se object) while the lisi and cellkNN.mat script is based is on the Signac object multi.

Care to help modify the script to address this?

vkartha commented 1 year ago

@sylestiel But it looks to me that the dimensions match (i.e. you're using the exact same cells - which you should be for both cell KNN determination and what is in your dorc matrix)? The reason I force check barcode names is to avoid accidentally using incorrect (or unmatched) cells.

Update: closer inspection, I checked your output and it does look like your dorcMat barcodes are CCGTTATG.... where as your celKNN is AAACAGC.... (so 1 must be ATAC and the other RNA, but better to use the same barcode names for both). I thought 10x data automatically uses scRNA barcodes for both ATAC and RNA. If you know the data is indeed matched (from the same cells), the simplest solution here prior to running the smoothing is to set the rownames to the barcodes that are in your dorcMat, so:

rownames(cellKNN.mat) <- colnames(dorcMat)

and then smooth.

Again the only reason I have these checks for barcode name matching is so that we don't accidentally smooth with the wrong neighborhood matrix or in a matrix that is in a different order (since the cellKNN.mat is just indices)

sylestiel commented 1 year ago

@vkartha Thank you so much!

The last script helped address the smoothing function. I was hoping it would work like the imputation of weights using MAGIC in ArchR. But I see a diffuse enrichment for the queried DORC gene(s) and not regionally concentrated enrichment. Can I increase the Window size from 10kb to 60kb or 100kb around TSS (preferably upstream of TSS) when looking at peak-Gene associations? More Issues: I get the UMAP with the original 'multi 'object but not the 'ATAC.se' object.

Also if I want to alter the layout of the 3D GRN network (more prominent text and brighter color nodes) how do I do that? For the Global regulation profile how can I label the specific geom_points.

Many Thanks!!!!