bm2-lab / scMVP

MIT License
28 stars 11 forks source link

Shape of 'latent_atac' doesn't match that of 'latent_rna' #3

Closed maren-ha closed 2 years ago

maren-ha commented 2 years ago

Hi, super cool method + paper! I'm trying to run the model on another scRNA+scATAC-seq dataset following the steps in the 10x_pbmc_demo. (Unfortunately, I cannot open the Baidu cloud disk so I cannot reproduce the issue on your data atm)

Specifically, when I run

full = trainer.create_posterior(trainer.model, dataset, indices=np.arange(len(dataset)),type_class=MultiPosterior)
latent, latent_rna, latent_atac, cluster_gamma, cluster_index, batch_indices, labels = full.sequential().get_latent()

the shapes of the latent_atac and latent_rna don't match. latent_atac.shape gives me (160,), while latent_rna.shape gives me (500, 20) (my dataset has 500 cells).

I've tried to track this down and realised that get_latent() calls sample_from_posterior_z here:

latent_temp = sample_from_posterior_z(full.model,
                [sample_batch_rna, sample_batch_atac], y=label, give_mean=True
            )

Here, the first two elements of latent_temp are of type <class 'tuple'>, while the third one is a <class 'torch.Tensor'>. Thus, when appending latent_temp[1][0].cpu() to the latent_rna list, it selects the first element of the tuple, which is a torch.Tensor of the correct shape (500,20). However, when appending latent_temp[2][0].cpu() to the latent_atac list, it selects the first row of the torch.Tensor, i.e., a vector of length 20.

I suppose this goes further down to the Encoder_nb_selfattention module for the ATAC encoder that outputs a tensor, while the Encoder_nb_attention module for the RNA encoder outputs a tuple (?)

My quick fix would be to replace latent_temp[2][0].cpu() with just latent_temp[2].cpu(), but I'm not a Python expert and don't know whether this messes up something else in the code.

Can you reproduce this behaviour also with your data? I would greatly appreciate any advice on how to handle this - many thanks in advance!

adamtongji commented 2 years ago

Hi maren-ha, Thank you for your appreciation to our method and paper!

We did not met the error you mentioned. I guess it might raised from the input data. Can you check the input scATAC data matrix? As our tool only support joint profiling dataset, the input scATAC data matrix should be "same cells"(or at least share cell) to the scRNA data matrix. And you can further check the object "dataset" for rna cells and atac cells after you load your data in python.

And I am so sorry that you cannot open the Baidu cloud disk. I tried to put the sciCAR demo dataset (as other datasets are larger than my dropbox space) to my dropbox folder in following link. The sciCAR demo dataset is the smallest demo dataset, and the data format is exactly same to "10x pbmc" you used, and the processing codes are also similar.

Can you try the sciCAR data use following shared dropbox link?

https://www.dropbox.com/sh/p0wyyeuutzw9je9/AAAD362HYPXDGrjgltXpYmHza?dl=0

maren-ha commented 2 years ago

Hi, thanks so much for the quick reply! And also many thanks for the sciCAR dataset, this is super helpful!

I ran the model now on the sciCAR dataset following your demo notebook, and the same issue appears. Please find my code below:

input_path='path_to_sciCAR_demo/'
sciCAR_cellline_dataset = {
                "gene_names": 'sciCAR_cellline_scale_gene.txt',
                "gene_expression": 'sciCAR_cellline_rna_normalize_count.mtx',
                "gene_barcodes": 'sciCAR_cellline_cell_barcode.txt',
                "atac_names": 'sciCAR_cellline_peak.txt',
                "atac_expression": 'sciCAR_cellline_atac_normalize_count.mtx',
                "atac_barcodes": 'sciCAR_cellline_cell_barcode.txt'
                }
dataset = LoadData(dataset=sciCAR_cellline_dataset,data_path=input_path,
                       dense=False,gzipped=False, atac_threshold=0.001,
                       cell_threshold=1)
n_epochs = 1 # just to run faster
lr = 1e-3
use_batches = False
use_cuda = False # I am using CPU 
n_centroids = 5 
n_alfa = 1.0

multi_vae = Multi_VAE_Attention(dataset.nb_genes, len(dataset.atac_names), n_batch=0, n_latent=20, n_centroids=n_centroids, n_alfa = n_alfa, mode="mm-vae") # should provide ATAC num, alfa, mode and loss type
trainer = MultiTrainer(
    multi_vae,
    dataset,
    train_size=0.9,
    use_cuda=use_cuda,
    frequency=5,
)
trainer.train(n_epochs=n_epochs, lr=lr)
full = trainer.create_posterior(trainer.model, dataset, indices=np.arange(len(dataset)),type_class=MultiPosterior)
latent, latent_rna, latent_atac, cluster_gamma, cluster_index, batch_indices, labels = full.sequential().get_latent()

Now, when I check the shape of latent_rna and latent_atac, it gives me:

>>> latent_rna.shape
(4316, 20)
>>> latent_atac.shape
(1360,)

I suspect it comes from the issue mentioned above, that when appending latent_temp[2][0].cpu() to the latent_atac list inside sample_from_posterior_z, it selects the first row of the torch.Tensor, i.e., a vector of length 20. This would fit the shape (1360,), since len(full.data_loader)= 68. When I make the change proposed above, i.e., replacing latent_temp[2][0].cpu() with just latent_temp[2].cpu(), the shapes match.

adamtongji commented 2 years ago

Hi, thanks so much for the quick reply! And also many thanks for the sciCAR dataset, this is super helpful!

I ran the model now on the sciCAR dataset following your demo notebook, and the same issue appears. Please find my code below:

input_path='path_to_sciCAR_demo/'
sciCAR_cellline_dataset = {
                "gene_names": 'sciCAR_cellline_scale_gene.txt',
                "gene_expression": 'sciCAR_cellline_rna_normalize_count.mtx',
                "gene_barcodes": 'sciCAR_cellline_cell_barcode.txt',
                "atac_names": 'sciCAR_cellline_peak.txt',
                "atac_expression": 'sciCAR_cellline_atac_normalize_count.mtx',
                "atac_barcodes": 'sciCAR_cellline_cell_barcode.txt'
                }
dataset = LoadData(dataset=sciCAR_cellline_dataset,data_path=input_path,
                       dense=False,gzipped=False, atac_threshold=0.001,
                       cell_threshold=1)
n_epochs = 1 # just to run faster
lr = 1e-3
use_batches = False
use_cuda = False # I am using CPU 
n_centroids = 5 
n_alfa = 1.0

multi_vae = Multi_VAE_Attention(dataset.nb_genes, len(dataset.atac_names), n_batch=0, n_latent=20, n_centroids=n_centroids, n_alfa = n_alfa, mode="mm-vae") # should provide ATAC num, alfa, mode and loss type
trainer = MultiTrainer(
    multi_vae,
    dataset,
    train_size=0.9,
    use_cuda=use_cuda,
    frequency=5,
)
trainer.train(n_epochs=n_epochs, lr=lr)
full = trainer.create_posterior(trainer.model, dataset, indices=np.arange(len(dataset)),type_class=MultiPosterior)
latent, latent_rna, latent_atac, cluster_gamma, cluster_index, batch_indices, labels = full.sequential().get_latent()

Now, when I check the shape of latent_rna and latent_atac, it gives me:

>>> latent_rna.shape
(4316, 20)
>>> latent_atac.shape
(1360,)

I suspect it comes from the issue mentioned above, that when appending latent_temp[2][0].cpu() to the latent_atac list inside sample_from_posterior_z, it selects the first row of the torch.Tensor, i.e., a vector of length 20. This would fit the shape (1360,), since len(full.data_loader)= 68. When I make the change proposed above, i.e., replacing latent_temp[2][0].cpu() with just latent_temp[2].cpu(), the shapes match.

You are CORRECT! It's an output dimension bug in our origin code, and we fixed it to "latent_temp[2].cpu()" as you mentioned. As we did not use the "latent_atac" in the downstream analysis we missed the error of "latent_atac" dimension.

Thank you again for your comment!