Ploidy - Githubissues

EvolEcolGroup / tidypopgen

https://evolecolgroup.github.io/tidypopgen/

GNU General Public License v3.0

1 stars 0 forks source link

Ploidy #19

Open dramanica opened 5 months ago

dramanica commented 5 months ago

We could support multiple ploidy. A column can only contain one vector, so storing the ploidy information there is difficult. However, we could simply add a ploidy column to the $fam slot of the bigSNP object in the attributes of genotypes. To implement that, we would need:

[x] When we create a gen_tibble, we should start adding a ploidy attribute to the genotypes column, with an integer corresponding to the ploidy of all individuals (if it is the same) or '0' to indicate multiple ploidy. Then we should make sure that functions that rely on diploids check for ploidy=2.
[x] Figure out how we read multiple ploidy (parsing a VCF) so that ploidy information is stored when the data is imported.
[x] Create a couple of example functions to operate on mixed ploidy to illustrate how it is done.

dramanica commented 5 months ago

A package to check out https://cran.r-project.org/web/packages/StAMPP/index.html

dramanica commented 5 months ago

A possible mixed ploidy dataset that looks pretty suitable as an example is: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6758580/ We should see how easy it is to put together the dataset, but it could work as a test set for mixed ploidy.

dramanica commented 4 months ago

An interesting paper detailing tools and theory https://pubmed.ncbi.nlm.nih.gov/36720820/

dramanica commented 4 months ago

There is now a ploidy branch. It implements storing the ploidy value, which is currently set to 2 by default in all cases. There are also check in most basic functions that stop them from operating if ploidy is not 2 (some might be fine, but they need to tested properly).

dramanica commented 4 months ago

We now have loci_maf(), loci_alt_freq() and loci_missingness() working with mixed ploidy. The unit tests (not complete) are bunched up in test_show_ploidy, but later it would be better to move them under each function. I envision that, for each function, we would have a testthat section for diploid, and one for polyploid. In terms of implementation, I am of the mind of using an optimised version for diploid, and a more generic function for polyploids (working on both homogenous ploidy, and mixed ploidy). Right now, I am coding the latter mostly in R, we can then think whether some functions would be best moved to C for speed. But I think it makes sense to get functionality first, and then focus on the bottlenecks for optimisation. Finally, @Euphrasiologist we should probably have a look together at vcf reading to think on how to bring in polyploids (again, probably with an optimised diploid version like we have now, and then a more general option for polyploids). I think we could look at the first locus, count the alleles for each individual, and then use to decide how to process the vcf. In terms of parsing, if we need a generic function, then it would get the odd characters and sum them (1/0/1/1), summing the 1st, 3rd, 5th and 7th element.

dramanica commented 4 months ago

A quick polyploid parser for a vector of genotypes:

genotypes<-c("1/0/1/1","0/0/0/1","1|1|0|1","./././.")
# get ploidy for each individual
sapply(strsplit(genotypes,"[/|]"),function(x) length(x) )
# get dosage for each individual
poly_dosage <- function (x){
  if (x[1]!="."){
    sum(as.numeric(x))
    } else {
    return(NA)
  }
}
sapply(strsplit(genotypes,"[/|]"),poly_dosage )

dramanica commented 4 months ago

We now have the full ploidy infrastructure in main. We don't do anything with multiple ploidy data, but in principle we do have the infrastructure for it.

dramanica commented 3 months ago

A lot of pop gen formulae for mixed ploidy rely on computing the pop frequencies as mean of individual frequencies (i.e. standardising the impact of ploidy so that the individual is the unit of replication, rather than the allele; this distinction is not important when dealing with uniform ploidy). If we adapt a couple of the cpp functions to compute frequencies from individual frequencies, then we should be able to easily adapt a lot of functions to multiple ploidy.