HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
177 stars 32 forks source link

Handling of biological variation during model creation #156

Closed mikemc closed 4 years ago

mikemc commented 4 years ago

Thanks for creating and maintaining this tool. I have a question about how the model creation works, in order to understand what type of datasets can be used. I was afraid that it might be necessary to have sequenced a sample where the true genome is known with essentially no variation from the reference. But I see that the docs say that an ocean metagenomics dataset was used for the MiSeq model. In that case there would be a lot of true biological variation, and many mismatches in the mapped reads would be biological rather than errors. Does InSilicoSeq try to distinguish biological variants from errors in the model-building process? Or do SNPs etc get counted as errors? Are their limitations on the types of datasets and degree of unknown biological variation within them, or can any metagenomic dataset be used to build a reliable model?

HadrienG commented 4 years ago

Good question that warrants a detailed answer! I'm very very busy at the moment (my thesis is due for printing in a few days) but I'll come back to you when things have calmed down.

/hadrien

HadrienG commented 4 years ago

Hi again,

There will indeed be biological variations such as different strains being collapsed into single contigs/genome from the environmental metagenomes. InSilicoSeq does not try to distinguish between sequencing errors and biological variation and SNPs do get counted as errors.

That said, InSilicoSeq's models have individual error probability distributions for each base position in the reads. Exact technical replicates are quite uncommon in genomics datasets, and therefore the SNPs will be distributed more or less evenly across the reads. You can expect a low background noise from this, but given the diversity of the dataset and the high number of reads, the errors should largely dominate.

Hope that helps! /Hadrien

mikemc commented 4 years ago

Ok, that makes sense, thanks for the clarification! From this, I gather that the most accurate error models would come from sequencing of a clonal isolate mapped against a reference for that specific strain. I will go ahead and close this issue as my main question was answered.

HadrienG commented 4 years ago

I gather that the most accurate error models would come from sequencing of a clonal isolate mapped against a reference

Yes, that is correct. But be careful that error profiles are likely influenced by genomic characteristics such as GC-content. Hence why I used diverse metagenomic dataset rather close to the data I usually analyse, despite the background noise from strain variation.