bodkan / demografr

Fast and simple simulation-based population genetic inference in R
https://bodkan.net/demografr
Other
26 stars 1 forks source link

How to handle sequencing error and ancient DNA damage? #1

Closed bodkan closed 1 year ago

bodkan commented 1 year ago

One major selling point of this R package is efficient, fast ABC inference using tree-sequences and slendr.

For lots of data sets, in particular low-coverage and/or ancient DNA data, erroneous SNP calls present an issue.

In the context of ABC, it is generally assumed that the summary statistics come from data which is reasonably clean and where various errors (especially aDNA damage) have been taken care of via filtering.

For many (?) analyses and summary statistics, aDNA errors would add noise around the true values of summary statistics. In cases like this, simulating summary statistics from perfect, clean tree sequences would not be a problem.

Still, it might be interesting to add an option to sprinkle artificial mutations that correspond to damage or sequencing errors on top of standard tree sequences. Then, summary statistics computation would proceed as normal, with the exception that it would be mutation-based rather than branch-based (which would be normally the mode of operation).

Although useful more generally, it doesn't make sense to suggest to add this functionality to tskit or msprime. Perhaps this package could have a tiny built-in Python submodule which would add damage or errors on top of mutated tree sequence.