Open dramanica opened 5 months ago
A package to check out https://cran.r-project.org/web/packages/StAMPP/index.html
A possible mixed ploidy dataset that looks pretty suitable as an example is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6758580/ We should see how easy it is to put together the dataset, but it could work as a test set for mixed ploidy.
An interesting paper detailing tools and theory https://pubmed.ncbi.nlm.nih.gov/36720820/
There is now a ploidy
branch. It implements storing the ploidy value, which is currently set to 2 by default in all cases. There are also check in most basic functions that stop them from operating if ploidy is not 2 (some might be fine, but they need to tested properly).
We now have loci_maf()
, loci_alt_freq()
and loci_missingness()
working with mixed ploidy. The unit tests (not complete) are bunched up in test_show_ploidy
, but later it would be better to move them under each function. I envision that, for each function, we would have a testthat section for diploid, and one for polyploid.
In terms of implementation, I am of the mind of using an optimised version for diploid, and a more generic function for polyploids (working on both homogenous ploidy, and mixed ploidy). Right now, I am coding the latter mostly in R, we can then think whether some functions would be best moved to C for speed. But I think it makes sense to get functionality first, and then focus on the bottlenecks for optimisation.
Finally, @Euphrasiologist we should probably have a look together at vcf reading to think on how to bring in polyploids (again, probably with an optimised diploid version like we have now, and then a more general option for polyploids). I think we could look at the first locus, count the alleles for each individual, and then use to decide how to process the vcf. In terms of parsing, if we need a generic function, then it would get the odd characters and sum them (1/0/1/1), summing the 1st, 3rd, 5th and 7th element.
A quick polyploid parser for a vector of genotypes:
genotypes<-c("1/0/1/1","0/0/0/1","1|1|0|1","./././.")
# get ploidy for each individual
sapply(strsplit(genotypes,"[/|]"),function(x) length(x) )
# get dosage for each individual
poly_dosage <- function (x){
if (x[1]!="."){
sum(as.numeric(x))
} else {
return(NA)
}
}
sapply(strsplit(genotypes,"[/|]"),poly_dosage )
We now have the full ploidy infrastructure in main. We don't do anything with multiple ploidy data, but in principle we do have the infrastructure for it.
A lot of pop gen formulae for mixed ploidy rely on computing the pop frequencies as mean of individual frequencies (i.e. standardising the impact of ploidy so that the individual is the unit of replication, rather than the allele; this distinction is not important when dealing with uniform ploidy). If we adapt a couple of the cpp functions to compute frequencies from individual frequencies, then we should be able to easily adapt a lot of functions to multiple ploidy.
We could support multiple ploidy. A column can only contain one vector, so storing the ploidy information there is difficult. However, we could simply add a ploidy column to the
$fam
slot of thebigSNP
object in the attributes ofgenotypes
. To implement that, we would need:ploidy
attribute to thegenotypes
column, with an integer corresponding to the ploidy of all individuals (if it is the same) or '0' to indicate multiple ploidy. Then we should make sure that functions that rely on diploids check forploidy=2
.