Closed grahamgower closed 2 years ago
This is something I wanted to discuss: the input data format. On one hand, I think it is very easy to just input tskit.TreeSequence
data, extract the genotype matrix from that and simulate the allele counts from the matrix. However, this is very limiting and specific for this tskit data type. If a genotype matrix is inputed instead (which is a more general data type), the user will be able to tweak the matrix in multiple ways (subset SNPs or individuals, filter SNPs by frequency...) prior to simulating GL. However, things get more complicated since genotype matrix do not contain all the info that the tskit.TreeSequence
data (e.g., SNP positions, samples IDs...).
In my opinion, I would make the general function to process genotype matrix. Additionally, I would add extra functions that handle other input types (such as tskit.TreeSequence
).
I think it makes sense to start with a function that works from the genotype matrix, and not to worry about tree sequence specifics. It might be possible to define an analogous function that works from a tskit.TreeSequence
, and uses the general_stat()
framework to speed things up. But there are many different ways one could choose to simulate read counts, and I think it would be helpful to formalise the model(s) for generating read counts before working on TreeSequence specific code.
Since the function is going to take only a genotype matrix, I think I can close this issue.
sim_allelereadcounts()
can accept input as a tskit.TreeSequence, or as a genotype matrix. This code will be clearer and more easily tested if the function is split into two: one for each input.