bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Ability to change error rate #111

Open DafniG opened 3 years ago

DafniG commented 3 years ago

Hi, Thank you for a great tool. I had a couple of questions. 1) The cell line NA12878 is heterozygous at about 2M or so sites. When generating the simulated reads based on the pre-trained model, is it assumed that at these sites there would be 50% reads containing each of the two alleles, or are expression patterns (along with potential biases) from the original reads maintained? 2) I want to play with the error rate of the reads to see that outcome on my downstream analyses, would altering the values in training_error_rate.tsv be enough, or would I need to do something more complex? Best

SaberHQ commented 3 years ago

Hi @DafniG Thanks for your interest in our tool.

To answer your second question, I should say that you can definitely play with those numbers in the tsv file you mentioned and increase/decrease them to study its outcome in your downstream analysis. Please also remember that NanoSim learns the probability of transitions between each error types using Markov Models. If you want, you may also edit that file manually but I do not suggest that (training_error_markov_model)

For your first question, I will leave it to @cheny19 to comment on that. But please note that NanoSim quantifies transcript expression levels and uses them for transcriptome simulation. So if I understood your question right, the answer is yes and expression patterns are maintained from the original reads used for training. Now, you can also provide any expression levels or use the stand alone expression quantification pipeline from NanoSim to quantify expression patterns using any input read set and fed them to the simulation pipeline as well.

DafniG commented 3 years ago

Thank you Saber. This is very useful.

DafniG commented 3 years ago

Hi Saber. I had a look at the proportion of reads overlapping NA12878 heterozygous sites containing the reference or the alternative allele. I used your pr-trained cDNA guppy model. This should be around 0.5 across all sites. However I observed that almost all reads contained the reference allele (see attached image). Therefore, I think the patterns at heterozygous sites are not maintained from the raw data, and instead the allele observed in the reference file gets used by default. I thought I'd let you know of thiss in case it comes up again in the future. ref_ratio_simulated_reads

cheny19 commented 3 years ago

Hi @DafniG ,

Sorry for the late reply. For your first question, if the heterozygous allele is not provided as a reference, NanoSim won't know how to introduce SNPs or other mutants to mimic the heterozygosity. If you want, you can simulate two datasets, one with the reference and the other with the heterozygous sites, and then mix them together. Your use case is very practical, and we will think about incorporating into the suite in the furture.

For your second question, no, play with the training_error_rate.tsv won't change the simulation results as it's not used for simulation. For simulation, we only use the Markov model and the mixture statistical model file. If you really want to have a control over the error rates, you can change the probabilities in the Markov model and see how it goes.

Thanks, Chen