workflow with multiple batchse

JensFGG commented 1 month ago

Hello, I'm playing around with a demo set and so far so good! Great tool!

I have a question on how to manage 15 different batches. Throughout these batches, there are 33k unique reads and 13k intersecting reads. Because of this big drop out I want to impute the missing genes.

I usually run all the QC steps separately and then merge them with a batch correction. However, if I also impute the datasets separately per batch I will still lose most of the genes, because there is no way if in batch 1 gene x is present that in batch 2 gene x will be correctly imputed, since it lacks the training to correctly impute it in batch 2.

How would you go about this? I'm thinking about the following approach

normalize the data and QC saparetly per batch
hvg selection for batch integration
add zero value rows for missing genes
batch integration/correction
impute the missing genes

Do you have suggestions/comments on this strategy and how to implement your tool with multiple batches?

Galaxy8172 commented 1 month ago

Hello, first of all, thank you for using our tools. To answer your question, our model is inspired by image processing. We convert the gene expression of each cell in scRNA-seq data into a "image", and then use convolution to capture the interdependencies between genes. However, unlike traditional image processing, in scRNA-seq data, arbitrarily changing the position of genes (e.g., swapping the data in the first row with the tenth row) theoretically does not affect the spatial relationship between genes. But during model training, the model cannot recognize this spatial relationship, so the trained model can only be applied to the training dataset and cannot be transferred to a completely different dataset.

Therefore, as you mentioned about batch datasets, if you want to use a model trained on one batch of data to fill in data from other batches, you must ensure that the structure of all batch datasets is consistent, meaning that they have the same genes, and the positions of the genes in the data must remain the same. This is what you referred to as adding zero values for missing genes.

In my processing, there are two methods: (1) First, remove the batch effect, then split the dataset by batch, using one batch as the training set to train the model and then fill in the data. (2) Use the merged dataset for training, and then use the model to fill in each batch of data.

Note that the premise of model transferability is that the batch effect has been removed, and the number and order of genes in each batch must be consistent.

JensFGG commented 1 month ago

thank you very much for your fast and elaborate answer. The batch effect will be removed based on the highly variable genes. This, however, will not take the zero counts in consideration, hence those zeros remain and might be batch specific in the merged file. Do you think that can impose a model fitting problem?

Galaxy8172 / scMultiGAN

workflow with multiple batchse #1