beast2vs1 test for EBSP fails

rbouckaert commented 3 years ago

The posterior of the testEBSP.xml fails for TreePriorTest. At some point it must have passed. When did this change and why? This might be a symptom of a bug introduced at some point. Finding out which commit made the test fail would be a good starting point.

walterxie commented 3 years ago

Hi Remco, after my investigation, it seemed the expectations for EBSP were totally wrong. I barely remember it was in development and might have been excluded from the test suits in the previous Ant call. I cannot figure it out when it started to be active. I apology for this mistaken in the past. So I ran the testEBSP.xml using BEAST 1.8.4, and corrected all exceptions in the EBSP test using the result of attached log. But it seemed there was still a gap between two posteriors. Expected posterior delta mean: -3841.2851 - -3762.515419489472 <= delta stdErr: 2*(8.559 + 3.0511764457087116) Two logs are attached.

BEAST 1 b1.tar.gz. testEBSP.xml

BEAST 2 b2.tar.gz. testEBSP.xml

There are 5 more parameters in another tests not passed as attached: results.tar.gz

walterxie commented 3 years ago

There were few problems to run BEAST 1:

I could not use BEAST 1.10.* in my Mac, which forces to use beagle 3, but throws exception when using it.
I have to downgrade to 1.8.4 without using beagle.
the testEBSP.xml in BEAST 1 seems difficult to converge. After I increased the chain length to 4 times of the original length, the ESS of posterior is still 85 in the attached log. I am running a 100M-state long analysis, it hopefully will finish this night.

walterxie commented 3 years ago

I also fixed the names of parameters.

walterxie commented 2 years ago

Here is the long run. The stats are very similar to previous log

b1-100M.tar.gz.

walterxie commented 2 years ago

I updated the two XMLs to use HCV dataset. There were some significant changes of BEAST 1. It looks unfamiliar to me now. I tried my best to make both XMLs comparable. I usedploidy="2.0" in beast1 and set factor = 2 in beast2. I cannot make Dirichlet working in beast 1.8, so I used Uniform prior for frequencies in both XMLs. Then, I manually compared the priors and likelihood sections in both XMLs to make them similar.

The updated results and xmls are attached. Please use 20% burnin:

posterior: | beast2. | beast 1 mean | -6867.1698 | -6897.0506 stderr of mean | 1.3096 | 2.5939

-6867.1698 - -6897.0506 = 29.8808 (1.3096 + 2.5939) * 2 = 7.807

b1.tar.gz

b2.tar.gz

walterxie commented 2 years ago

I just found beast2 default not to use "Intervals Middle" but beast 1 did, which made above two models different. I will re-run the tests tomorrow.

walterxie commented 2 years ago

The new logs at long runs are attached. Both suppose to use the same priors and models:

b1-100M.tar.gz

b2-100M.tar.gz

Tree stats are same. But the coalescent likelihoods and popSizes are different, which may be caused by the different implementation of models?

rbouckaert commented 2 years ago

@walterxie it looks like the first population size, and the population mean are consistent between the runs, but subsequent population sizes are double in b1 compared to b2. Maybe the factor=2 has something to do with that?

Has there been a time in the past where the test passed? Or did all commits since the beginning of time contain this faulty test? If there is a time in the past where this test passed, then it must be possible to find the time the test stopped passing and investigate the changes made in that commit that caused the change of behaviour.

walterxie commented 2 years ago

@rbouckaert I may be wrong, but according to Joseph's tutorial, the factor looks equivalent to ploidy in BEAST1. I will go through the code to see whether I could confirm this or not.

I am suspecting the implementations of EBSP model between BEAST 2 and 1 may have differences. Not only popSizes, but the indicators are very different as well. I have tried to make the priors as similar as I could in the above tests. The posteriors of tree stats and subst model parameters are looking very similar.

This test was incomplete but should be excluded from Ant testing. I am not sure when it started to be used.

alexeid commented 2 years ago

I would change the ploidy to 1 to check whether that fixes the difference.

walterxie commented 2 years ago

OK. I also found the EBSP analyser in both, but they required to log trees. I will try these 2 changes.

walterxie commented 2 years ago

I have completely re-designed the tests as Alexei's advised:

use HCV data set.
use factor=1 and ploidy =1. After reading both code, I can confirm they are the same parameter to scale the popSize.

I ran the analysers in both beasts. So the new logs, xmls and plots are attached below.

BEAST 1 : b1.tar.gz
BEAST 2 : b2.tar.gz

I also combined the two plots into one, where X is "time before present", blue is beast1, red is beast2: b1vs2.pdf

I think there are differences between BEAST 1 and 2. But not sure why it only affects the popSize near the present. And I am also not sure the wider HPD interval near the present (beast1) is correct or narrower (beast2)?

walterxie commented 2 years ago

Here is Joseph's plot https://bmcecolevol.biomedcentral.com/articles/10.1186/1471-2148-8-289/figures/7

walterxie commented 2 years ago

I found the difference between beast 1 plot and beast 2 was caused by the MixedDistributionLikelihood only existing in beast 1 xml prior section.

<mixedDistributionLikelihood>
       <distribution0>
              <exponentialDistributionModel idref="demographic.populationMeanDist"/>
       </distribution0>
       <distribution1>
              <exponentialDistributionModel idref="demographic.populationMeanDist"/>
       </distribution1>
       <data>
              <parameter idref="demographic.popSize"/>
       </data>
       <indicators>
              <parameter idref="demographic.indicators"/>
       </indicators>
</mixedDistributionLikelihood>

BEAST 2 used

<prior id="popSizePrior.alltrees" name="distribution" x="@popSizes.alltrees">
      <Exponential id="popPriorDist.EBSP" mean="@populationMean.alltrees" name="distr"/>
</prior>

I will try to make these two equivalent, before continuing the comparison.

rbouckaert commented 2 years ago

The fragment indicates it is possible to assign a different prior to popSize entries based on the indicators, but since distribution0 and distribution1 refer to the same distribution, it should still produce the same prior as the one in BEAST 2 (assuming an exponential with mean=populationMean is used). If there is a difference, it must be something else.

The plots b1vs2.pdf shows only a very small difference at the uncertainty intervals at the beginning -- the means are a very good match. Perhaps this has something to do with burn-in?

walterxie commented 2 years ago

Yes. You are correct about the mixedDistributionLikelihood. I also double-checked the Java code and XMLs for this and other priors except of Coalescent. They look same. So the only part to be unchecked is EBSP coalescent.

If you look at the beast 1 log, the mean of the posterior of popSize mean is about 8.8, and then all 63 means of popSizes are between 7.4 and 9.2. But beast 2 log only has the mean of the 1st popSize around 8.8, and the rest of 62 means are around 4. Beast 1 result looks more reasonable.

If the 1st popSize starts from tips (present). I am suspecting there may be a bug in the likelihood calculation of beast 2 EBSP to halve the popSizes after the 1st interval towards the root. But I do not understand why the plots are matched when the time is away from the present, if this assumption could be true. I will have a try.

walterxie commented 2 years ago

I ran the another comparison tests with new setting: stepwise, and not using mid points. BEAST 2's plot seemed wrong:

b2.pdf

But BEAST 1 looked fine:

b1.pdf

XMLs and logs are shared in the Dropbox.

rbouckaert commented 2 years ago

Interestingly, values on the left hand side of the graph are significantly different, but on the right hand side they look compatible.

alexeid commented 2 years ago

BEAST2 is definitely broken with these settings. The BEAST1 results are the familiar results for this data set.

walterxie commented 2 years ago

@alexeid I found another potential problem as well: after I used the beast 1 analyser to plot beast 2 log as you advised, the plot shape looked same to beast 1, but the mean and HPD intervals were different.

b2-a1.pdf

I think the similar shape cannot confirm the bug is only in beast 2 beast.app.tools.EBSPAnalyser.

In addition, their plotting mechanisms are totally different. Not sure why the compatibility was not considered at all. So the difference is:

beast 1 analyser uses the original log, and takes "popSize" and "indicators" as the input.
beast 2 has to log an extra log default to "EBSP.log", which looks like containing the 1st popSize and the coalescent intervals. Then the beast 2 analyser takes this special log as the input. Beast 1 does not have this, I cannot test beast 1 plot using beast 2 analyser. Otherwise it would be another good comparison.

Also considering the difference and strange pattern of the posterior distribution of popSizes and indicators, I assume the bug could be still somewhere in CompoundPopulationFunction, perhaps.

walterxie commented 2 years ago

I added two new xmls examples/beast2vs1/beast1/testEBSP-b1.xml examples/beast2vs1/testEBSP-b2.xml to replicate these problems. The previous 2 xmls were renamed to testEBSP-old.xml.

CompEvol / beast2

beast2vs1 test for EBSP fails #999