gao-lab / GLUE

Graph-linked unified embedding for single-cell multi-omics data integration
MIT License
382 stars 56 forks source link

Some questions about scclue #25

Closed HelloWorldLTY closed 2 years ago

HelloWorldLTY commented 2 years ago

Hi Dr. Cao, I come back with a confusing problem after I re-run the code related to clue, which is published by the single cell problems organizers. image I think this code can yield cell embeddings with dim=50 if I set latent dim as 50, however, even if I set such things, I still get dim=100 data, which is the same as the dimensions of network input, for this model. image

Is there anything wrong with my understanding? Thanks a lot.

Jeff1995 commented 2 years ago

Thanks for the question! The CLUE model uses modality-specific latent subspaces, which are concatenated to form a complete latent space. The latent_dim parameter sets the subspace dimensionality, so the complete latent space would be 50 * 2 = 100 dimensional.

HelloWorldLTY commented 2 years ago

Thanks for the question! The CLUE model uses modality-specific latent subspaces, which are concatenated to form a complete latent space. The latent_dim parameter sets the subspace dimensionality, so the complete latent space would be 50 * 2 = 100 dimensional.

So does it mean that the [0:50] of the latent space comes from scrnaseq, while the last [50:100] of the latent space comes from scatacseq? If I intend to test the effect of multi-omics data integration by presenting the multimodal labels, what should I do based on scclue? Thanks a lot.

Jeff1995 commented 2 years ago

There is a matrix of encoders in the play, including:

  1. RNA data -> RNA latent [0:50]
  2. RNA data -> ATAC latent [50:100]
  3. ATAC data -> RNA latent [0:50]
  4. ATAC data -> ATAC latent [50:100]

When both RNA and ATAC data are observed in a multimodal cell, the RNA latent [0:50] is computed by taking the mean of the RNA latent from both RNA and ATAC data (encoder 1 & 3). The same goes for the ATAC latent [50:100] (encoder 2 & 4).

When only one modality, e.g., the RNA data, is observed in a unimodal cell, the RNA latent [0:50] is computed only from the available RNA data (encoder 1). The same goes for the the ATAC latent [50:100] (encoder 2).

Could you clarify what kind of data are you trying to evaluate the integration on (multimodal / unimodal)? I think in most cases the concatenated representation [0:100] should be used for CLUE.

HelloWorldLTY commented 2 years ago

Hi, my data only contain scRNA+scATAC data. I intend to use scclue to generate joint embedding of the data. Therefore, I think I need to use [0:50] to represent rna embedding while [50:100] to represent atac embedding. Is this correct? Thanks a lot.

I cannot use glue because of the bedtools problem.

Jeff1995 commented 2 years ago

Do you mean that the data is unpaired scRNA-seq and scATAC-seq data?

If all cells are unpaired, then CLUE is not applicable (it relies on a subset of paired multimodal cells to train). You can only use GLUE in this case. You will be able to set custom bedtools path in the next release (coming soon!)

If the cells are paired, then CLUE can be used, and you should use the full [0:100] embedding to represent each cell.

HelloWorldLTY commented 2 years ago

Oh, I think I got your ideas. Many thanks for your reply!

Jeff1995 commented 2 years ago

Great! Let me know if there are any further issues!