Open jonathandesmedt92 opened 3 years ago
Jonathan, thanks for your interest in our simulator.
First, note that if your data is impacted by technical noise, then it is a more difficult task to reliably estimate master regulators' basal expression from the data. You might have noticed that we used a seqFISH data in our paper, which is known to be less impacted by technical noise such as dropout. So in general there are two cases: 1) you have a data for which you don't know a reliable GRN and you only want to generate a synthetic data with an arbitrary GRN that resembles the real data in some statistics. All you need to do is to arbitrarily parameterize your GRN and then add technical noise such that some summary statistics match between the real and synthetic data (this is what we did in the first part of our paper); 2) you have a data for which you know a reliable GRN and therefore you know master regulators. In this case, if you want to reliably re-produce the real data you need to first reduce the technical noise in the real data. If your data comes from seqFISH, it's already relatively clean and you are good to go, otherwise you need to first impute the data by some existing de-noising algorithms.
I am assuming that you are working on the second problem. If that's the case here is what I suggest: a) impute the data if it's not clean enough, b) I would say it's better to estimate master regulators' basal expression from raw count data (before TPM normalization and before log normalization). Since some imputation algorithms already perform library size normalization, you can project it back to the original raw count space by applying reverse TPM normalization on each imputed single-cell. c) Simulate the data using estimated values. d) Now it's time to add technical noise, to do so you need to compare the simulated data with the original data (and not the imputed one). Add technical noise to the raw simulated data and compare it with the original data. I would suggest you do this comparison in logTPM space: This means you add technical noise to raw simulated data, next do library size normalization and log transform on both real (raw count) and simulated (after technical noise added) data and then compare distributions.
Keep in mind that simulator simulates the endogenous mRNA content of a cell which can be more closely approximated by raw mRNA count (here we assume that capture efficiency was the same for all cells in the real scRNA-seq experiment). So all parameters can be better estimated from raw data, and then similar normalizations should be applied on both real and synthetic data. Let me know if if you have more questions.
Dear Payam,
Thank you very much for your extensive reply! I could make things work in the mean time (and I'll play around some more with other datasets) and your answer helped me a lot!
Kind regards,
Jonathan
Dear,
Nice research! I think these kind of tools are quite needed in the community.
At the moment I'm trying to generate a synthetic data set with characteristics similar to a particular real data set. At the moment I'm most interested in generating the same data set but with different levels of drop-out noise. I estimated the K and Q parameters from the real data set. I had a few a questions though; if you would be so kind in helping me out?
My question is whether you could comment on which expression values you recommend to use as an input for the 'input_file_regs.txt' file? In the paper (Synthetic data set generation) you mentioned the basal production rate is based on the expression state and several data sets, but do you recommend using raw count values, TPMs, log TPMs, or some other unit as basal production rate? I assume different units (especially log units vs non-log units) would affect the outcome of the Langevin equations, as well as the distribution of the simulated expression valuues. For instance, when I calculated log2 TPMs of the real data set, these show an expected somewhat normal distribution. When I used master regulator expression values from this real data set (i.e. log2TPMs) as basal production rate, the simulated data doesn't have this distribution at all. It looks more like a Poisson distribution to me. I was wondering whether this is somehow expected from the use of Langevin equations and the fact that the production rate of any gene would be bounded between zero and infinity (or at least between zero and the sum of all Ks) (Equation 5-7)? How would you recommend dealing with this, in a way that the synthetic data distribution would approximate the real data distribution? And consequently, in a way that the drop-out logistic curve would have similar parameters (i.e. K, Y0, and Q) when comparing real and synthetic data?
Many thanks in advance for any comments you can provide!
Kind regards, Jonathan