Error rate of cDNA model in NanoSim

gy-chop commented 3 years ago

Hi @cheny19 ,

Thank you for developing and maintaining NanoSim. I am using NanoSim to simulate long read RNA-seq data, and it works well so far and seems very useful for our evaluation of pipeline. However, I found the only existing model in the folder "pre-trained_models/" for bulk cDNA sequencing, human_NA12878_cDNA_Bham1_albacore, seems to have pretty unusual error rate model. It is shown as '1D2' in "readme", but has even higher error rate (>17%) compared to 1D (usually 10-15%), and more surprisingly, the insertion rate is much higher than deletion rate (7.22% vs. 5.07%), which I never met in real Nanopore sequencing data before.

We would like to use one of the built-in models to test our pipeline to avoid potential question from reviewers, but currently the only model for bulk cDNA-seq seems not typical. Do you think more models can be added to NanoSim in near future? It would be highly appreciated if this request can be considered. I believe many users would also have similar request on this.

Thanks a lot!

cheny19 commented 3 years ago

Hi @gy-chop ,

Thanks for your interest in NanoSim. Yes, we will provide more models if the community provides high-quality dataset. But you can always train with your own dataset, as long as you document the parameters you used, I don't see how this can become a potential question for reviewers.

As for your first question, I thing the 1D2 in the readme is a typo. The high error rate than usual we provided is probably due to the calculation. In our calculation, the denominator is all aligned bases, but to our knowledge, sometimes people also use the query length as the denominator, so inevitably, our error rate is higher than usual. That being said, I'm not too sure why the insertion rate is higher than deletion rate for this particular dataset. Currently I'm making a new release and I'll re-train all the models for compatibility issues, so I'll take a look then.

Chen

gy-chop commented 3 years ago

Thanks for the reply. If 1D2 is a typo, that is a relief of my concern. However, I don't think using either all aligned bases or query length as the denominator will result in the higher insertion rate than deletion rate, as long as insertion rate and deletion rate are calculated with the same denominator. Another big concern is that I found in the simulated reads, the proportion of incomplete isoform (also known as ISM) is larger or similar to full length isoforms (FSM), but in the read data the latter are usually much more than the former. I know NanoSim simulates length distribution, but is the proportion of full length isoforms considered in NanoSim? Similar length distribution doesn't make sure similar proportion of FSM, and the latter is more important in RNA-seq analysis. Thanks!

cheny19 commented 3 years ago

Hi @gy-chop , sorry for the late reply. Yes, NanoSim takes into the proportion of isoform into account. We use a Kernel Density function to simulate the ratio of sequenced bases vs full length.

bcgsc / NanoSim

Error rate of cDNA model in NanoSim #95