aristoteleo / dynamo-release

Inclusive model of expression dynamics with conventional or metabolic labeling based scRNA-seq / multiomics, vector field reconstruction and differential geometry analyses
https://dynamo-release.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
415 stars 59 forks source link

Debatch before dynamo pipeline running #204

Closed 117999 closed 2 years ago

117999 commented 3 years ago

Hi Dr.Qiu,

I am recently analyzing scEU-seq data and conventional scRNA-seq data with dynamo, where I have found a huge batch effect between samples of different developmental stages or different labelling times. So, I have tried to do BBKNN to debatch before running the dyn.pp.recipe_monocle() My code is as follows:

import bbknn
import scanpy.external as sce
sc.tl.pca(E10X)
bbknn.bbknn(E10X,batch_key='batch')
dyn.pp.recipe_monocle(E10X)

but it reports key error 'X'. The 'X_unspliced' is not calculated.

Then, I tried to run preprocessing and dynamics calculation first

dyn.pp.recipe_monocle(E10X)
dyn.tl.dynamics(E10X, model='stochastic', cores=3,experiment_type='conventional')
sc.tl.pca(E10X)
bbknn.bbknn(E10X,batch_key='batch')
sc.tl.umap(Embryo_10X)
dyn.tl.cell_velocities(Embryo_10X, method='pearson', basis='umap',color='louvain')

where it report TypeError: Unordered Categoricals can only compare equality or not

I have checked a similar procedure on scVelo, and it works. So, I am wondering whether you could give some advice on how to do de-batch before dynamo analysis. Also, the performance of de-batch BBKNN on python is worse than that of reciprocal PCA anchors from Seurat. In this regard, is it possible to do de-batch in Seurat and transport the data into python and run dynamo afterwards?

Thanks, Yuchen

Xiaojieqiu commented 3 years ago

Hi @117999

you need to do batch correction after recipe_monocle because it will need to first normalize the data. something like this:

dyn.pp.recipe_monocle(E10X)
bbknn.bbknn(E10X,batch_key='batch')
dyn.tl.dynamics(E10X, model='stochastic', cores=3)
dyn.tl.reduceDimension(Embryo_10X)
dyn.tl.cell_velocities(Embryo_10X)

You can try to debatch the data in seurat and convert the seurat object to adata object and then do the downstream analysis with dynamo. You can find multiple resources on how to convert seurat object to adata.

I have to comment that the current batch correction methods were designed for total RNA analysis but maybe not ideal for velocity estimation. Though I am a computational biologist, I am NOT a big fan of any batch correction methods as it is very difficult to discern what is the real biological difference and batch effects. In general, I prefer to find solutions to problems experimentally instead of computationally. I would suggest stacking and multiplexing your experiments to avoid potential batch effects directly from your experiment design.

Lastly ensure you have upgraded dynamo

117999 commented 3 years ago

Thanks, Dr Qiu, The solution you have provided does work. I have also tried to transfer the unlabelled dataset (leftover transcriptome) to do integration in Seurat, and transfer back the integrated UMAP back for velocity projection. As you've said, the batch correction is not perfectly compatible with velocity estimation, the performance of velocity projection on batch-corrected UMAP is not ideal. None of the strategies I have tried (BBKNN, Harmony, Scanorama, Seurat CCA) produce optimal projection plot

Xiaojieqiu commented 3 years ago

what do you mean transfer leftover transcriptome to do integration and transfer back?

nonetheless, the umap should be created with the total RNA information.

and yeah, batch correction for velocity will be tricky and it is not my intention to work on this in the foreseeable future for reasons I have mentioned above.

117999 commented 3 years ago

In scEU-seq, there are two libraries, the nascent transcriptome on Beads, and the leftover transcriptome on Supernatant. The way I transfer data is to initially load all the data into python (let's say you have three batches of dataset B1/B2/B3; and S1/S2/S3 [B for Beads, S for supernatant]) By the way, here we use STARsolo to do the spliced splitting.

B2_spliced=sc.read_10X('./2B_100bpSolo.out/spliced')
B2_unspliced=sc.read_10X('./2B_100bpSolo.out/unspliced')
B2_raw=sc.read_10X('./2B_100bpSolo.out/Gene/raw')
B2_raw.layers['spliced']=B2_spliced.X
B2_raw.layers['unspliced']=B2_unspliced.X

similar data loading for B1/B3 and S1/S2/S3, Then we merge transform Beads data as layers of Supernatant

Dynamo2=S2_1_raw3.copy()
Dynamo2.layers['uu']=S2_raw.layers['unspliced']
Dynamo2.layers['ul']=B2_raw.layers['unspliced']
Dynamo2.layers['su']=S2_raw.layers['spliced']
Dynamo2.layers['sl']=B2_raw.layers['spliced']
Dynamo2.layers['new']=Dynamo2.layers['ul']+Dynamo2.layers['sl']
Dynamo2.layers['total']=Dynamo2.layers['ul']+Dynamo2.layers['sl']+Dynamo2.layers['uu']+Dynamo2.layers['su']
del Dynamo2.layers['spliced']
del Dynamo2.layers['unspliced']
del Dynamo2.layers['ul']
del Dynamo2.layers['sl']
del Dynamo2.layers['uu']
del Dynamo2.layers['su']

A similar procedure works for B1/S1 and B3/S3; Afterwards, we concatenate the data and prepare of transferring them to Seurat We applied a self-developed package scDIOR, a cross platform data IO software for single cell RNA-seq data

Dynamo=Dynamo2.concatenate(Dynamo1,Dynamo3,join='outer') 
import anndata
import PyIOH5 as myh5 
ad = anndata.AnnData(X=Dynamo.X,obs=Dynamo.obs,var=Dynamo.var,obsm=Dynamo.obsm)
myh5.output.write_h5(ad, file='Dynamo.h5', save_raw=False)

Next, we process the integration in Seurat and use FindIntegrationAnchors() to do the batch correction

library(Seurat)
library(SingleCellExperiment)
library(hdf5r)
library(RIOH5)
library(dplyr)
dynamo_seurat = read_h5(target.object = 'seurat', file = './Dynamo.h5')

After generating the UMAP of the batch corrected data. We reload them back to python write_h5(data=Dynamo.combined.sct,object.type = 'seurat',file = './Dynamo_integration_r.h5') Then, we run normal dynamo pipeline.

dyn.pp.recipe_monocle(Dyanmo,total_layers=False, n_top_genes=2000)
dyn.tl.dynamics(Dyanmo, model='deterministic',tkey='time',assumption_mRNA='ss',experiment_type='kin')
dyn.tl.reduceDimension(Dyanmo,reduction_method='umap')

Finally, before velocity projection, we substituted the umap generated by dynamo with the one produced in Seurat

Dynamo_integrated= myh5.input.read_h5(file='./Dynamo_integration_r.h5')
Dynamo.obsm['X_umap']=Dynamo_integrated.obsm['X_umap']
dyn.tl.cell_velocities(Dyanmo, enforce=True,vkey='velocity_S',ekey='M_t',basis='umap')
dyn.pl.streamline_plot(Dynamo,color=['bacth'],show_legend=True,pointsize=0.1)

Considering there is no counterpart of layers of adata in seurat object, we thought that the nascent transcriptome on Beads is not processed in Seurat of batch correction, instead, only the leftover transcriptome (Supernatant) are involved in the batch correction in seurat.

Xiaojieqiu commented 3 years ago

thanks for the detailed code, I have comments here and will address your question in the end:

B2_spliced=sc.read_10X('./2B_100bpSolo.out/spliced')
B2_unspliced=sc.read_10X('./2B_100bpSolo.out/unspliced')
B2_raw=sc.read_10X('./2B_100bpSolo.out/Gene/raw')
B2_raw.layers['spliced']=B2_spliced.X
B2_raw.layers['unspliced']=B2_unspliced.X

you may need to make sure the row and column names of B2_spliced/B2_unspliced is the same as B2_raw


Dynamo2=S2_1_raw3.copy()
Dynamo2.layers['uu']=S2_raw.layers['unspliced']
Dynamo2.layers['ul']=B2_raw.layers['unspliced']
Dynamo2.layers['su']=S2_raw.layers['spliced']
Dynamo2.layers['sl']=B2_raw.layers['spliced']
Dynamo2.layers['new']=Dynamo2.layers['ul']+Dynamo2.layers['sl']
Dynamo2.layers['total']=Dynamo2.layers['ul']+Dynamo2.layers['sl']+Dynamo2.layers['uu']+Dynamo2.layers['su']
del Dynamo2.layers['spliced']
del Dynamo2.layers['unspliced']
del Dynamo2.layers['ul']
del Dynamo2.layers['sl']
del Dynamo2.layers['uu']
del Dynamo2.layers['su']

You could you also keep the spliced/unspliced layers and you should set Dynamo2.X = Dynamo2.layers['total'].copy()

dyn.pp.recipe_monocle(Dyanmo,total_layers=False, n_top_genes=2000)
dyn.tl.dynamics(Dyanmo, model='deterministic',tkey='time',assumption_mRNA='ss',experiment_type='kin')
dyn.tl.reduceDimension(Dyanmo,reduction_method='umap')

Please try the dyn.tl.recipe_kin_data(Dyanmo) to run the kinetic data expression model. This is our analysis recipe for kinetics dataset. You can read its documentation for information.

Dynamo_integrated= myh5.input.read_h5(file='./Dynamo_integration_r.h5')
Dynamo.obsm['X_umap']=Dynamo_integrated.obsm['X_umap']
dyn.tl.cell_velocities(Dyanmo, enforce=True,vkey='velocity_S',ekey='M_t',basis='umap')
dyn.pl.streamline_plot(Dynamo,color=['bacth'],show_legend=True,pointsize=0.1)

The above is inaccurate, vkey='velocity_S', should pair with ekey='M_s' and vkey='velocity_T', should pair with ekey='M_t'

Also note that the dyn.tl.recipe_kin_data(Dyanmo) will calculate all the velocity_T' andvelocity_S` for you.

considering your question, when you set Dynamo2.X = Dynamo2.layers['total'].copy(), and save the data to seurat to analyze, it will use the total RNA for the anchoring and batch correction

117999 commented 3 years ago

Hi Dr.Qiu

Thanks for the help. In my previous trials, I kept only the 'new' and 'total' layers, and only the 'Velocity_T' and 'Velocity_N' generated after running dyn.tl.dynamics().

I have tried to preserve the 'sl', 'ul', 'uu', and 'su' layers in later trials, and found the corresponding ekey and vkey yielded.

Still, I am curious about the interpretation of the velocity projection, since the applying of the four different combinations of [vkey and ekey] generated purely four different and even opposite velocity directions (trends).

Sincerely, Yuchen

Xiaojieqiu commented 3 years ago

In general, you should focus on velocity_T (total RNA velocity) and velocity_S (spliced RNA velocity). velocity_N (new RNA velocity) will be also non-negative because it reflected new RNA velocity. Unspliced RNA velocity (velocity_U) only reflected velocity of unspliced RNA and may not be the best representation of cell state.

Also the labeling velocity (velocity_T) should be more accurate while the reversed velocity direction for velocity_S probably reflects some assumption for the RNA velocity estimation on the splicing data is not satisfied. Let me know whether this is the case

117999 commented 3 years ago

Thanks, Dr. Qiu

I am wondering in datasets where Unspliced RNA over-represent in cells (sci-fate and sci-RNA-seq, unspliced% >50%), should we use the velocity_U instead?

Xiaojieqiu commented 3 years ago

it is more common to use spliced RNA or total RNA instead of unspliced RNA to represent the cell state but if the unspliced RNA fraction is very high, it is worth trying to use those to represent cell states.

Nonetheless, RNA velocity using splicing data relies on whether the constant transcription rate assumption is correct. and also since it assumes splicing rate is universally one across all genes, it is relative and the downstream differential geometry analysis from dynamo will be less accurate than the labeling data based velocity.

Xiaojieqiu commented 3 years ago

I wanna quickly add one more key comment: the new RNA data often gives a much better revelation of cell states. If this is the case for your data, you may use velocity_N instead of velocity_T to reveal a better RNA velocity flow. There are some intricacies if you want to do this though, here is a demo on how to do it properly:

adata_labeling.obsm['X_pca'] = adata_labeling.obsm['New_pca'][:, :30] # you need to use the adata_labeling.layer['X_new'] to generate the pca space based on the new data here.  

# delete those so you can recalculate then based on the new RNA's pca information 
for i in ['M_t', 'M_tt', 'M_n', 'M_tn', 'M_nn', 'velocity_N', 'velocity_T']:
    del adata_labeling.layers[i]

dyn.tl.dynamics(adata_labeling, one_shot_method='sci_fate', model='deterministic',  group='time') # if you are using just one-shot experiment 
dyn.tl.reduceDimension(adata_labeling, X_data=adata_filtered[adata_labeling.obs_names].obsm['X_pca'][:, :30], enforce=True) # also use the new pca to do the dimension reduction 

dyn.tl.cell_velocities(adata_labeling,
                       basis='umap',
                       X=adata_labeling.layers['M_n'], # here you need to use M_n and velocity_N to get the new RNA velocity and project it to umap space generated also with new RNA
                       V=adata_labeling.layers['velocity_N'],
                       enforce=True,
                       )
dyn.pl.streamline_plot(
   adata_labeling,
    ncols=4,
    basis='umap',
    show_legend='best'
)