KevinMenden / scaden

Deep Learning based cell composition analysis with Scaden.
https://scaden.readthedocs.io
MIT License
71 stars 25 forks source link

Simulate distribution #89

Closed khkk378 closed 1 year ago

khkk378 commented 3 years ago

As it is now you sample random fractions for each cell type. I wonder if it would be more efficient to sample based on the prior distributions in the training data, or at least have this as a setting. Maybe just from a normal distribution around the current proportions. In most cases the training data would be rather similar to the bulk; a large fraction of kidney cells would be tubular, liver would be hepatocytes, heart would be cardiomyocytes and so on. Just a suggestion :)

KevinMenden commented 3 years ago

Hi @khkk378 ,

yes that might make sense as an additional option. We intentionally didn't do it because it of course introduces some bias into the training set. If you have only one dataset for data simulation, and this is somewhat weirdly distributed, that could be problematic. And scRNA-seq data is not the best tool for estimating cell type fractions, sometimes cells are also selected.

So would be an interesting thing to try as an option - I believe that the default should still be random fractions.

But if you want to cook up a PR, I would be happy to include that :)