DEploid-dev / DEploid

dEploid is designed for deconvoluting mixed genomes with unknown proportions. Traditional ‘phasing’ programs are limited to diploid organisms. Our method modifies Li and Stephen’s algorithm with Markov chain Monte Carlo (MCMC) approaches, and builds a generic framework that allows haloptype searches in a multiple infection setting.
http://deploid.readthedocs.io/en/latest/index.html
GNU General Public License v3.0
20 stars 9 forks source link

dEploid mcmc not converging #346

Open bredelings opened 3 years ago

bredelings commented 3 years ago

Describe the bug

When run multiple times on the same data set, dEploid oscillates between two different answers. For example,

first answer:

         Effective_K: 1.00005
          Inferred_K: 1
Adjusted_effective_K: 1.00005

Proportions:
1.09651e-05   0.999976  1.32491e-05 2.21877e-08

second answer:

         Effective_K: 1.75907
          Inferred_K: 2
Adjusted_effective_K: 1.75907

Proportions:
  0.314951  4.04259e-06   0.685045  3.59593e-08

In this specific, case its clear from looking at the data that K=1 is right.

dEploid version: 659c770cb7ed

To Reproduce Steps to reproduce the behavior:

  1. dEploid -vcf ERR2678989.vcf -plaf plaf.txt -o ERR2678989-nopanel -noPanel
  2. -seed=0 produces one answer, but -seed=1 produces a different answer

I also tried increasing the run time from 800 to 8000 generations, but it didn't help.

Expected behavior It seems that that the MCMC would be give the same answer each time. However, it seems that there are two modes, and the MCMC cannot move between them.

Additional context I might be able to help you improve the mixing.

bredelings commented 3 years ago

Here are some files: plaf.txt ERR2678989.vcf.txt

shajoezhu commented 3 years ago

Hi @bredelings , had a quick look at the data. You have got a lot of missing data. Would it be ok to encode your "." to zeros? does it make sense? I am not an expert in the sequence you are dealing with, but DEploid does not handle missing data at the moment.

Made a quick plot with the alt vs ref count, on the left is what's in the data, on the right, is after encode "." as zeros.

tmp

Eyeballing this data, it should be a mixed sample, unless there is a lot of error with the reading that you may want to filter them out.

DEploid MCMC struggle with certain mixing is a known problem. I don't have a solution yet, It would be awesome if you could help to solve this problem. Very often, we find it is the data is very difficult. our strategy is to take multiple runs, and take the consensus result. There is a newer version, "Best-practice" is soon coming. I want to finalize some of the documentation. But for this version, you must provide a reference panel. We found that it is very useful when learning k, especially if you have a very balanced mixing.

vcfFileName = "ERR2678989.vcf.txt"
numberOfHeaderLines = 42
ADFieldIndex=7
vcf <- read.table( gzfile(vcfFileName), skip = numberOfHeaderLines,
        header = T, comment.char = "", stringsAsFactors = FALSE,
        check.names = FALSE)

    sampleName <- names(vcf)[10]

    tmp <- vcf[[sampleName]]
    field <- strsplit(as.character(tmp), ":")

    tmpCovStr <- unlist(lapply(field, `[[`, ADFieldIndex))
    tmpCov <- strsplit(as.character(tmpCovStr), ",")

    refCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 1)))
    altCount <- as.numeric(unlist(lapply(tmpCov, `[[`, 2)))

refCount0 = refCount
refCount0[is.na(refCount)] = 0
altCount0 = altCount
altCount0[is.na(altCount)]=0
plot(refCount0, altCount0)
png("tmp.png", width = 1200, height =600)
par(mfrow=c(1,2))
plot(refCount, altCount)
plot(refCount0, altCount0)
dev.off()
bredelings commented 3 years ago

I modified the utilities scripts locally to convert . to 0, similar to the PR for DEploid itself. Octopus writes VCFs with . when there is no haplotype that covers on of the alleles -- the author insists that it is a legal way to write VCFs, so probably the scripts should handle . by converting it to 0. I don't think this indicates anything wrong with the data. I can submit a PR to the utilities repo.

I did not think this was a mixed sample, because a comparison with other samples that are clearly mixed shows a very sparse scattering of mixed sites. Here are two plots, one for ERR2678989 (not mixed, I think) and ERR2678993 (mixed). ERR2678989altVsRefAndWSAFvsPLAF ERR2678993altVsRefAndWSAFvsPLAF Do you still think that ERR2678989 is a mixed sample? If it is really mixed, would additional loci be mixed as well? Could it be that some loci just have mismapped reads and so appear mixed, but are not?

shajoezhu commented 3 years ago

@bredelings I am with you. Should be clonal I think. Do you have a reference panel for this?

What your initial k is set to?

shajoezhu commented 3 years ago

https://deploid.readthedocs.io/en/latest/FAQ.html#over-fitting

Hey @bredelings We have seen this problem before as well. You can force and set k to 1. And just use the plaf as the reference to learn the haplotype, and use this as a reference strain and build a panel to deconvolve your other samples. I have used this approach for pf3k

bredelings commented 3 years ago

Hi @shajoezhu, thanks for the advice! I will try adding a reference panel. I wasn't sure it would help because it seems that rho/theta is pretty high in vivax, so linkage disequilibrium is pretty low. But maybe the panel could still help. Do you know if rho/theta is lower in falciparum -- i.e. there are more SNPs per recombination event?

I haven't had time to fully look at the mixing in sufficient detail yet - I will get back to you soon, hopefully next week.

shajoezhu commented 3 years ago

Hi Ben, it works for vivax as well. You can consider to use some of these data

https://www.malariagen.net/data/p-vivax-genome-variation-may-2016-data-release

Pretty sure I had done this some time ago, can send across if I can find it them

On Sat, 6 Feb 2021 at 07:43, Benjamin Redelings notifications@github.com wrote:

Hi @shajoezhu https://github.com/shajoezhu, thanks for the advice! I will try adding a reference panel. I wasn't sure it would help because it seems that rho/theta is pretty high in vivax, so linkage disequilibrium is pretty low. But maybe the panel could still help. Do you know if rho/theta is higher in falciparum?

I haven't had time to fully look at the mixing in sufficient detail yet - I will get back to you soon, hopefully next week.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DEploid-dev/DEploid/issues/346#issuecomment-774419279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FP7NDR2IHWF2T73EMUSTS5TXLLANCNFSM4WZOJQMQ .

bredelings commented 3 years ago

OK, I had a few ideas about fixing mixing problems. It looks like you have transition kernels to update (w|h) and transition kernels to update (h|w), where h is the haplotypes and w is the weights. The transition kernels for updating (h|w) can resample one or two haplotypes at a time. Does that sound right?

I think the issue is that if h and w are correlated, then updating w but not h might not work. So, what you need is a move that

I can write down the math for detailed balance if needed.

It looks like computing the probability by summing out the haplotypes could be implemented fairly easily in UpdatePairHap::calcFwdBwdProbs() by just summing the last column.

I'm not sure if I'm missing something in UpdatePairHap:: calcExpectedWsaf( ) though. What is that doing?

bredelings commented 3 years ago

Another idea was to allow adding and removing haplotypes instead of fixing the number at five and setting the prior so that some have very low frequencies. This probably makes summarizing the posterior more complex. Maybe you would want to generate multiple summaries in that case, i.e. a summary for K=1 with one haplotype, a summary for K=2 with two haplotypes, a summary for K=3 with 3 haplotypes.

In any case, I think that you could add a 0/1 variable Z[k] for each of the 5 haplotypes, and say that x'[k] ~ normal(eta, sigma^2) x[k] = if z[k] == 1 then x'[j] else -Inf

It would be possible to actually remove x'[k] and h[k] from the MCMC state when Z[k] ==0 and the haplotype is missing. You would draw x'[k] from the prior and draw h[k] using UpdateSingleHap when adding the haplotype back in. But it would also be possible to keep all 5 haplopypes even when Z[k] == 0. However, for the purposes of mixing, it would still be important to resample h[k] from the posterior using UpdateSingleHap when flipping Z[k] to 1.

bredelings commented 3 years ago

Any thoughts?

It also looks like the doc branch has disappeared... do you know where it went?

shajoezhu commented 3 years ago

OK, I had a few ideas about fixing mixing problems. It looks like you have transition kernels to update (w|h) and transition kernels to update (h|w), where h is the haplotypes and w is the weights. The transition kernels for updating (h|w) can resample one or two haplotypes at a time. Does that sound right?

Yes, that is correct

The correlation will depend on what type of infection this is. Super infection vs co infection. We talk about this in the elife DEploid ibd paper. The former you can consider that they are independent, but the latter is heavily correlated, which is more complex, biologically more interesting

I think the issue is that if h and w are correlated, then updating h but not w might not work. So, what you need is a move that

  • proposes a move from w1 to w2 by changing the relative frequencies of strain i and strain j. This could perhaps perturb x[i] and x[j]
  • computes the probability of w1 by summing out the haplotypes for i and j using dynamic programming.

Can you be more specific with the dynamic programming part? And what do you mean by summing out the haplotye if I may ask.

  • computes the probability of w2 by summing out the haplotypes for i and j using dynamic programming.
  • accept or reject the new weights w2
  • if w2 is accepted, resample haplotypes i and j from the dynamic programming matrix associated with w2

I was wondering can you be more specific with this matrix? What’s the size, it will probably help me to better understand the dynamic programming part.

  • if w2 is rejected, then resample haplotypes i and j from the dynamic programming matrix associated with w1 I can write down the math for detailed balance if needed.

It looks like computing the probability by summing out the haplotypes could be implemented fairly easily in UpdatePairHap::calcFwdBwdProbs() by just summing the last column.

I'm not sure if I'm missing something in UpdatePairHap:: calcExpectedWsaf( ) though. What is that doing?

at each site, there is the observed wsaf, which is alt/(ref+alt), and expected wsaf from the model, of inner product of the proportion with the genotypes or strains at this site.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DEploid-dev/DEploid/issues/346#issuecomment-777823271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FP7O2QDBAJV4FHTABLWDS6RIBXANCNFSM4WZOJQMQ .

shajoezhu commented 3 years ago

I really like this idea. I had done something like this before, didn’t succeed! Please let me know what you think, and feel free to give a go. I really would like to know why I failed for this approach, cos I thought it would have worked, or at least with some improvement

On Thu, 11 Feb 2021 at 23:36, Benjamin Redelings notifications@github.com wrote:

Another idea was to allow adding and removing haplotypes instead of fixing the number at five and setting the prior so that some have very low frequencies. This probably makes summarizing the posterior more complex. Maybe you would want to generate multiple summaries in that case, i.e. a summary for K=1 with one haplotype, a summary for K=2 with two haplotypes, a summary for K=3 with 3 haplotypes.

In any case, I think that you could add a 0/1 variable Z[k] for each of the 5 haplotypes, and say that x'[k] ~ normal(eta, sigma^2) x[k] = if z[k] == 1 then x'[j] else -Inf

It would be possible to actually remove x'[k] and h[k] from the MCMC state when Z[k] ==0 and the haplotype is missing. You would draw x'[k] from the prior and h[k] using UpdateSingleHap when adding the haplotype back in.

Please feel free to try it out for any new ideas. I would love to hear your thoughts, and perhaps we can schedule calls as well. My time is a little unpredictable at the moment due to child care, and I apologize for the late reply.

Some idea and experience from my part is that, you also need to consider the sequencing technology and mapping quality. The sequence coverage varies along sequence, simple cumulative probability can be heavily affected by some high read count. It would be better to do by chuncks, DEploid ibd has some functionality to adjust that. But my experience is that nothing does better job when you have a good quality reference panel, that’s why I spent quite a bit of effort in the latest best practices version to improve the panel. I am happy with the results, just taking time to get all documentation ready for release

You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DEploid-dev/DEploid/issues/346#issuecomment-777867183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FP7LVHAQOEN5DZFLFZ2TS6RSYHANCNFSM4WZOJQMQ .

shajoezhu commented 3 years ago

I removed the remote one cos I was using the local version heavily, and it started to branch off again. I didn’t realize you were rebased to that. Perhaps, we can find the hash, and create a branch that is dedicated for your tests? Let me know. Thanks!

On Fri, 19 Feb 2021 at 01:20, Benjamin Redelings notifications@github.com wrote:

Any thoughts?

It also looks like the doc branch has disappeared... do you know where it went?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DEploid-dev/DEploid/issues/346#issuecomment-781747341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FP7LM52PRHS6CO72MZPTS7W4FJANCNFSM4WZOJQMQ .

bredelings commented 3 years ago

Sorry, I just saw this because the e-mails went into my spam folder. Yeah, finding some time to talk would be great. I am also really interested in your ideas about the sequencing technology and mapping quality. More later.

bredelings commented 3 years ago

I have some patches I wanted to submit to illustrate some of what I was talking about and clarify about the dynamic programming. Is there a good branch for me to make the patches against? Mostly its against updateHap.cpp

bredelings commented 3 years ago

BTW, it looks like there might be some double-counting in this line of UpdateSingleHap::calcFwdProbs():

            fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * pNoRec + massFromRec);

It looks like the transition from i -> i is included in both fwdProbs_.back()[i] * pNoRec and also in massFromRec. If so, maybe this could be fixed by changing the line to:

            fwdTmp[i] = emission_[j][panel_->content_[hapIndex][i]] * (fwdProbs_.back()[i] * (pNoRec - pRecEachHap) + massFromRec);

Not entirely sure...

shajoezhu commented 3 years ago

Hi Ben, just branch off from the latest master should be fine. I will avoid to work on this file as well. Thank you!

On Fri, 26 Feb 2021 at 20:16, Benjamin Redelings notifications@github.com wrote:

I have some patches I wanted to submit to illustrate some of what I was talking about and clarify about the dynamic programming. Is there a good branch for me to make the patches against? Mostly its against updateHap.cpp

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/DEploid-dev/DEploid/issues/346#issuecomment-786871483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4FP7ID6R67CPZ64X6AXVLTA76R5ANCNFSM4WZOJQMQ .