Velocity genes and data preprocessing

hannanebelung commented 1 year ago

Hi,

I have been trying to use LatentVelo with my data. I have managed to install the tool and use it with one example dataset from scVelo. However, when I use it with my data , I run into the following error

Warning, folder already exists. This may overwrite a previous fit. 
0 velocity genes used
Traceback (most recent call last):
  File "/rna-velocity/src/scRNA_latentvelo.py", line 106, in <module>
    epochs, val_ae, val_traj = ltv.train_anvi(model, adata, batch_size = batch_size,
  File "/lib/python3.9/site-packages/latentvelo-0.1-py3.9.egg/latentvelo/trainer_anvi.py", line 104, in train_anvi
  File "/lib/python3.9/site-packages/latentvelo-0.1-py3.9.egg/latentvelo/models/annot_vae_model.py", line 223, in loss
  File "/lib/python3.9/site-packages/latentvelo-0.1-py3.9.egg/latentvelo/utils.py", line 22, in unique_index
  File "/lib/python3.9/site-packages/latentvelo-0.1-py3.9.egg/latentvelo/utils.py", line 22, in <listcomp>
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

I guess that the error results from the 0 velocity genes that were used in the model . Could you maybe explain what the role of the velocity genes in the calculation is and how they are determined?

For the data preprocessing, I followed the examples in your repository and have tried both ltv.utils.anvi_clean_recipe and ltv.utils.standard_clean_recipe. Would you recommend any other preprocessing steps for the data ?

Any help would be really appreciated. Your tool looks very interesting, so it would be great to use it!

Spencerfar commented 1 year ago

Generally this happens when the genes are very noisy. I would try something like sc.pp.filter_cells(adata,min_genes=30) to see if that fixes it. I have some updates soon that might also be useful, for example increasing the number of neighbors when computing moments for unspliced and spliced would also reduce noise. In the examples I also always run scv.pp.filter_genes(adata, min_shared_counts=20) or similar for real data, when you run this how many genes are left over? If it removes almost all of the genes then the noise is likely the issue.

Alternatively, you can disable requiring velocity genes by setting corr_velo_mask=False when specifying the model, then it should run without requiring velocity genes. You can also set correlation_reg=False with this to turn off the part of the model that uses the velocity genes entirely.

hannanebelung commented 1 year ago

Thanks for the suggestions! After running scv.pp.filter_genes(adata, min_shared_counts=20) a lot of genes, approximately 60% have been filtered out. So the data does seem somewhat noisy. With the filtering step it still does not work.

I also tried disabling the velocity genes in the VAE model with both methods that you suggested. When I use corr_velo_mask=False or correlation_reg=False, I get either a Runtime error RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument. or a Value error ValueError: With n_samples=1, test_size=0.1 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.. I only disabled the velocity genes in the model definition. Do I need to change something else as well in the the training?

AlePur commented 9 months ago

I had the same problem when I tried to integrate multiple samples that were too different from each other. I would guess that this tool doesn't work for data that is too noisy and complex, and your best bet is running it on a subset of the data -- but this is just my guess, and I have only experimented with LatentVelo for a week.

Spencerfar / LatentVelo

Velocity genes and data preprocessing #3