Closed chenc327 closed 1 year ago
Hi @chenc327! Thanks for your interest in GLUE.
I'm not particularly familiar with ArchR. Does it use all peaks or just the 50,000 variable peaks to do clustering and UMAP? My experience with ATAC-seq data is that using the complete peak set generally produce better results than just a selected subset. If that is the case, could you export all peaks as input?
One alternative solution is to keep using the 50,000 variable peaks, while at the same time export the ArchR cell embeddings and store it in the obsm
slot of your AnnData
object. You can then specify the input embedding in the configure_dataset
function to use the ArchR embeddings, e.g.:
atac.obsm["X_archr"] = archr_embedding
scglue.models.configure_dataset(
atac, "NB", use_highly_variable=True,
use_rep="X_archr"
)
But the caveat is that not all peaks in the 50,000 variable peaks will be linked to highly variable genes in other modalities, so you still risk losing information useful for data integration. To get the best results, I would still recommend exporting all peaks as input, if that's possible.
Let me know if there were further problems.
You're so kind, and please forgive me for my delayed responses during this period. Yes, I've examined your code related to the TF-IDF implementation, and I agree that using only a subset of peaks might not be appropriate. Therefore, I believe it's best to avoid recomputation. Due to computational resource limitations, I'm currently only able to try the alternative methods you mentioned.
Thanks for developing such a great tools! I have some questions about the preprocessing of scATAC data: Due to memory limitations, I used ArchR to output 50,000 variable peaks as the input matrix. However, I noticed a significant difference between the UMAP representation after processed by
scglue.data.lsi
and the initial clustering based on ArchR. My UMAP is in a very mixed state and I am having difficulty distinguishing the original clustering. Could you please let me know if there might be something wrong?