gao-lab / GLUE

Graph-linked unified embedding for single-cell multi-omics data integration
MIT License
382 stars 56 forks source link

question abou pre-process of ATAC data #90

Closed chenc327 closed 1 year ago

chenc327 commented 1 year ago

Thanks for developing such a great tools! I have some questions about the preprocessing of scATAC data: Due to memory limitations, I used ArchR to output 50,000 variable peaks as the input matrix. However, I noticed a significant difference between the UMAP representation after processed by scglue.data.lsi and the initial clustering based on ArchR. My UMAP is in a very mixed state and I am having difficulty distinguishing the original clustering. Could you please let me know if there might be something wrong?

Jeff1995 commented 1 year ago

Hi @chenc327! Thanks for your interest in GLUE.

I'm not particularly familiar with ArchR. Does it use all peaks or just the 50,000 variable peaks to do clustering and UMAP? My experience with ATAC-seq data is that using the complete peak set generally produce better results than just a selected subset. If that is the case, could you export all peaks as input?

One alternative solution is to keep using the 50,000 variable peaks, while at the same time export the ArchR cell embeddings and store it in the obsm slot of your AnnData object. You can then specify the input embedding in the configure_dataset function to use the ArchR embeddings, e.g.:

atac.obsm["X_archr"] = archr_embedding

scglue.models.configure_dataset(
    atac, "NB", use_highly_variable=True,
    use_rep="X_archr"
)

But the caveat is that not all peaks in the 50,000 variable peaks will be linked to highly variable genes in other modalities, so you still risk losing information useful for data integration. To get the best results, I would still recommend exporting all peaks as input, if that's possible.

Let me know if there were further problems.

chenc327 commented 1 year ago

You're so kind, and please forgive me for my delayed responses during this period. Yes, I've examined your code related to the TF-IDF implementation, and I agree that using only a subset of peaks might not be appropriate. Therefore, I believe it's best to avoid recomputation. Due to computational resource limitations, I'm currently only able to try the alternative methods you mentioned.