Sequencing Error Rate Estimation in F1 Populations

dcgerard commented 3 years ago

{updog} can sometimes have poor performance when all of the following occur:

model = "f1" or model = "s1",
Offspring sequenced at low depth, but
Parents are sequenced to large depth, and
All offspring have the same genotype. So the parental genotypes could be AA and aa (for diploids).

The issue is that {updog} can prefer to estimate a large sequencing error rate and place the parents to both have genotypes of 1.

This should only affect the model = "f1" and model = "s1" options.

eaa231c869e8d2e4c1fd11dc28e53e6270c9d5c0 patches things a little bit, by placing an upper bound on the sequencing error rate.

However, full correction will need me to dive back into the code. Right now, I don't use the parental sequences when optimizing the sequencing error rate (just the offspring sequences). Adjusting the objective to include parental sequences should fix the problem.

dcgerard commented 1 year ago

@Cristianetaniguti, could you take a look at the f1 branch of {updog} and see if this mostly resolves the issues that you discovered?

You can install that repo via:

remotes::install_github(repo = "dcgerard/updog", ref = "f1")

Cristianetaniguti commented 1 year ago

Hey @dcgerard,

Sorry for taking a long time to test it. But I finally did!

I also tested another idea that you gave me when we met in San Diego, about using a combination of the updog genotype probabilities with a global error of 5%. I included as an error rate in OneMap HMM: 1 - (1 - estimated genotype probability by updog)*(1-global error)

I built linkage maps for f1 populations using a subset of the data and the combinations:

markers from freebayes and GATK
using updog from the main branch and from the f1 branch
two diploid datasets, one with a depth ~6x and another ~90x
using the genotype probabilities from updog, a global error rate of 5%, and the combination of updog genotype probabilities and a global error rate of 5%
Removing non-informative (when both parents are homozygous or at least one of them is missing) markers before running updog

My conclusions are:

Removing the non-informative before performing the updog improves the map quality while using the f1 and main branch, mainly in the high-depth dataset.
Except in one of the scenarios (map with freebayes markers and using updog probabilities + global error), using the f1 branch provided maps slightly worse than using the main branch
Using the genotype probabilities + global error provided the same or better maps than using only the global error 👍

I hope this answers your question. I think it is still worth using the main branch and filtering the non-informative before using updog.

dcgerard commented 1 year ago

Thanks for the work @Cristianetaniguti! This is very interesting (and counterintuitive, since the f1 branch should give better results when using model = "f1"). Some questions:

You used model = "f1" in both cases, right?
These results are only for diploids, right?

Cristianetaniguti commented 1 year ago

Hey @dcgerard,

Yes, I used the exact same code. I only changed the updog version.
Yes, I am still adapting Reads2Map to polyploids.

These issues are only details, overall updog performs pretty well and it usually improves low-depth data genotypes. Let me know if you have other ideas to test. I can do it easily now.

dcgerard commented 1 year ago

Thank you so much! I really appreciate you working on this!

dcgerard commented 10 months ago

Closing for now. Don't think there is anything more to do here at the moment.

dcgerard / updog

Sequencing Error Rate Estimation in F1 Populations #19