How to simulate dataset with several batches?

ZJUFanLab / scCube

an SRT simulator for simulating multiple spatial variability in spatial resolved transcriptomics and generating unbiased simulated SRT data

GNU General Public License v3.0

15 stars 1 forks source link

How to simulate dataset with several batches? #2

Open VivLon opened 1 month ago

VivLon commented 1 month ago

Hi,

I have two questions:

I have a scRNA-seq dataset with 3 batches and I want to simulate a spatial dataset preserving the batch effect, is it better to simulate each batch separately or directly simulate the entire dataset?
Is the output matrix of scCube logcounts? Do I need further log-normalization before analysis?

Thank you very much!

ForwardYang98 commented 1 month ago

Hi, Sorry for the delayed response. I was unfortunately infected with COVID-19 last week so I didn't keep up with the activity on GitHub. For your questions:

I have not previously simulated datasets with batch effects, but I surmise the both of two approaches you mentioned should be able to preserve the batch effect. If you have any results to share, I would be most interested in seeing them!
The output matrix of scCube is logcounts and you don't need further log-normalization before analysis.

VivLon commented 1 month ago

Sorry to hear that. Hope you are feeling better now.

For the questions:

I personally chose to simulate each batch separately, so that I could get a spatial coordinate for each batch, instead of one aligned spatial coordinate for all batches.
I've tried selecting 2000 highly variable genes using the simulated data without log-normalization. But it gives "ValueError: cannot specify integer bins when input data contains infinity". https://github.com/scverse/scanpy/issues/2242 The answer in this issue recommends log-normalizing data. I did so and the error didn't appear again. Any suggestions?

ForwardYang98 commented 1 month ago

Thanks for your feedback. By default, the scCube‘s VAE framework takes log-normalized data as input, and as a result, the simulated data generated is also log-normalized. You can check if the data is log-normalized before training the VAE model. Please let me know if you have any further questions or concerns.