Closed maren-ha closed 2 years ago
Hi maren-ha, Thank you for your appreciation to our method and paper!
We did not met the error you mentioned. I guess it might raised from the input data. Can you check the input scATAC data matrix? As our tool only support joint profiling dataset, the input scATAC data matrix should be "same cells"(or at least share cell) to the scRNA data matrix. And you can further check the object "dataset" for rna cells and atac cells after you load your data in python.
And I am so sorry that you cannot open the Baidu cloud disk. I tried to put the sciCAR demo dataset (as other datasets are larger than my dropbox space) to my dropbox folder in following link. The sciCAR demo dataset is the smallest demo dataset, and the data format is exactly same to "10x pbmc" you used, and the processing codes are also similar.
Can you try the sciCAR data use following shared dropbox link?
https://www.dropbox.com/sh/p0wyyeuutzw9je9/AAAD362HYPXDGrjgltXpYmHza?dl=0
Hi, thanks so much for the quick reply! And also many thanks for the sciCAR dataset, this is super helpful!
I ran the model now on the sciCAR dataset following your demo notebook, and the same issue appears. Please find my code below:
input_path='path_to_sciCAR_demo/'
sciCAR_cellline_dataset = {
"gene_names": 'sciCAR_cellline_scale_gene.txt',
"gene_expression": 'sciCAR_cellline_rna_normalize_count.mtx',
"gene_barcodes": 'sciCAR_cellline_cell_barcode.txt',
"atac_names": 'sciCAR_cellline_peak.txt',
"atac_expression": 'sciCAR_cellline_atac_normalize_count.mtx',
"atac_barcodes": 'sciCAR_cellline_cell_barcode.txt'
}
dataset = LoadData(dataset=sciCAR_cellline_dataset,data_path=input_path,
dense=False,gzipped=False, atac_threshold=0.001,
cell_threshold=1)
n_epochs = 1 # just to run faster
lr = 1e-3
use_batches = False
use_cuda = False # I am using CPU
n_centroids = 5
n_alfa = 1.0
multi_vae = Multi_VAE_Attention(dataset.nb_genes, len(dataset.atac_names), n_batch=0, n_latent=20, n_centroids=n_centroids, n_alfa = n_alfa, mode="mm-vae") # should provide ATAC num, alfa, mode and loss type
trainer = MultiTrainer(
multi_vae,
dataset,
train_size=0.9,
use_cuda=use_cuda,
frequency=5,
)
trainer.train(n_epochs=n_epochs, lr=lr)
full = trainer.create_posterior(trainer.model, dataset, indices=np.arange(len(dataset)),type_class=MultiPosterior)
latent, latent_rna, latent_atac, cluster_gamma, cluster_index, batch_indices, labels = full.sequential().get_latent()
Now, when I check the shape of latent_rna
and latent_atac
, it gives me:
>>> latent_rna.shape
(4316, 20)
>>> latent_atac.shape
(1360,)
I suspect it comes from the issue mentioned above, that when appending latent_temp[2][0].cpu()
to the latent_atac
list inside sample_from_posterior_z
, it selects the first row of the torch.Tensor, i.e., a vector of length 20. This would fit the shape (1360,)
, since len(full.data_loader)
= 68. When I make the change proposed above, i.e., replacing latent_temp[2][0].cpu()
with just latent_temp[2].cpu()
, the shapes match.
Hi, thanks so much for the quick reply! And also many thanks for the sciCAR dataset, this is super helpful!
I ran the model now on the sciCAR dataset following your demo notebook, and the same issue appears. Please find my code below:
input_path='path_to_sciCAR_demo/' sciCAR_cellline_dataset = { "gene_names": 'sciCAR_cellline_scale_gene.txt', "gene_expression": 'sciCAR_cellline_rna_normalize_count.mtx', "gene_barcodes": 'sciCAR_cellline_cell_barcode.txt', "atac_names": 'sciCAR_cellline_peak.txt', "atac_expression": 'sciCAR_cellline_atac_normalize_count.mtx', "atac_barcodes": 'sciCAR_cellline_cell_barcode.txt' } dataset = LoadData(dataset=sciCAR_cellline_dataset,data_path=input_path, dense=False,gzipped=False, atac_threshold=0.001, cell_threshold=1) n_epochs = 1 # just to run faster lr = 1e-3 use_batches = False use_cuda = False # I am using CPU n_centroids = 5 n_alfa = 1.0 multi_vae = Multi_VAE_Attention(dataset.nb_genes, len(dataset.atac_names), n_batch=0, n_latent=20, n_centroids=n_centroids, n_alfa = n_alfa, mode="mm-vae") # should provide ATAC num, alfa, mode and loss type trainer = MultiTrainer( multi_vae, dataset, train_size=0.9, use_cuda=use_cuda, frequency=5, ) trainer.train(n_epochs=n_epochs, lr=lr) full = trainer.create_posterior(trainer.model, dataset, indices=np.arange(len(dataset)),type_class=MultiPosterior) latent, latent_rna, latent_atac, cluster_gamma, cluster_index, batch_indices, labels = full.sequential().get_latent()
Now, when I check the shape of
latent_rna
andlatent_atac
, it gives me:>>> latent_rna.shape (4316, 20) >>> latent_atac.shape (1360,)
I suspect it comes from the issue mentioned above, that when appending
latent_temp[2][0].cpu()
to thelatent_atac
list insidesample_from_posterior_z
, it selects the first row of the torch.Tensor, i.e., a vector of length 20. This would fit the shape(1360,)
, sincelen(full.data_loader)
= 68. When I make the change proposed above, i.e., replacinglatent_temp[2][0].cpu()
with justlatent_temp[2].cpu()
, the shapes match.
You are CORRECT! It's an output dimension bug in our origin code, and we fixed it to "latent_temp[2].cpu()" as you mentioned. As we did not use the "latent_atac" in the downstream analysis we missed the error of "latent_atac" dimension.
Thank you again for your comment!
Hi, super cool method + paper! I'm trying to run the model on another scRNA+scATAC-seq dataset following the steps in the 10x_pbmc_demo. (Unfortunately, I cannot open the Baidu cloud disk so I cannot reproduce the issue on your data atm)
Specifically, when I run
the shapes of the
latent_atac
andlatent_rna
don't match.latent_atac.shape
gives me(160,)
, whilelatent_rna.shape
gives me(500, 20)
(my dataset has 500 cells).I've tried to track this down and realised that
get_latent()
callssample_from_posterior_z
here:Here, the first two elements of
latent_temp
are of type<class 'tuple'>
, while the third one is a<class 'torch.Tensor'>
. Thus, when appendinglatent_temp[1][0].cpu()
to thelatent_rna
list, it selects the first element of the tuple, which is atorch.Tensor
of the correct shape (500,20). However, when appendinglatent_temp[2][0].cpu()
to thelatent_atac
list, it selects the first row of thetorch.Tensor
, i.e., a vector of length 20.I suppose this goes further down to the
Encoder_nb_selfattention
module for the ATAC encoder that outputs a tensor, while theEncoder_nb_attention
module for the RNA encoder outputs a tuple (?)My quick fix would be to replace
latent_temp[2][0].cpu()
with justlatent_temp[2].cpu()
, but I'm not a Python expert and don't know whether this messes up something else in the code.Can you reproduce this behaviour also with your data? I would greatly appreciate any advice on how to handle this - many thanks in advance!