dcgerard / updog

Flexible Genotyping of Polyploids using Next Generation Sequencing Data
https://dcgerard.github.io/updog/
24 stars 8 forks source link

Missing data for tetraploid multiparenting population #25

Open PaulaEB opened 1 year ago

PaulaEB commented 1 year ago

Hello David, Thanks for developing updog! My project goal is identify QTLs for pest resistance, so we have a multiparenting population similar to a NAM pop (4 pollen recipients and a pollen donor) so we have four half-sib families. We are treating each family separated but I'd like to know your thoughts about if it's possible to do use all the population for the genotype calling.

And a last question would be about the missing data for de geno field. In the multidog$inddf output we don't see missing data, is this normal?

Thank you very much! Paula E

dcgerard commented 1 year ago

Hey @PaulaEB,

Thanks for trying out {updog}!

I haven't gotten around to allowing for multiparent populations yet. Some things you can look into:

  1. Are the genotypes estimated to be the same for the same parent for runs on different populations?
  2. Are the sequencing error rates, allele biases, and overdispersions estimated to be about the same at the same SNP?

If the answer is yes to both, then combining the different populations would not help much. Estimating the parent genotypes and those parameters is the benefit of using a larger sample size.

As for the missing data, if an individual has NA listed, then it should provide NA in the output. If it has 0 listed for the read-depth, then {updog} will impute the genotype from the prior distribution (which is the best you can do if you aren't use information from other SNPs). E.g. consider:

library(updog)
refvec <- c(3, 4, 0, 8, 3)
sizevec <- c(10, 10, 0, 10, 10)
fout <- flexdog(refvec = refvec, sizevec = sizevec, ploidy = 4, )
fout$geno
plot(fout$postmat[3, ], fout$gene_dist)
abline(0, 1)

refvec <- c(3, 4, NA, 8, 3)
sizevec <- c(10, 10, NA, 10, 10)
fout <- flexdog(refvec = refvec, sizevec = sizevec, ploidy = 4, )
fout$geno

Best, David

PaulaEB commented 10 months ago

Hello @dcgerard, many thanks for your clarification! I am going back to this data, but I would like to keep the missing (0) missing as GATK mark the missing values in DP as DP=0 (https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531-GenotypeGVCFs-and-the-death-of-the-dot)

Is it possible to change that from updog or should I do that in the VCF with other tool?

Thanks again Paula

dcgerard commented 10 months ago

Yey @PaulaEB,

You can do that in R really easily.

E.g., suppose this is the matrix containing the read-depths:

sizemat <- matrix(c(0, 1, 2, 1,
                    1, 0, 1, 1,
                    1, 2, 1, 0), ncol = 4, byrow = TRUE)

Then we can convert those 0's to NA's via:

sizemat[sizemat == 0] <- NA

Cheers, David