OnlinePhylo / sts

Sequential Tree Sampler for online phylogenetics
GNU General Public License v3.0
16 stars 3 forks source link

check out performance of SMC on mixture distributions #58

Closed matsen closed 11 years ago

matsen commented 11 years ago

... a la Mossel and Vigoda 05, which is now in smctex repo.

matsen commented 11 years ago

Given an equal mixture of ((s1,s2),(s3,(s4,s5))); and ((s1,s5),(s3,(s2,s4)));

sts gives....

stoat bppsim/mixture_sim ‹58-mixture-sims*› » grep -v ext=0x0 sts.out | cut -f 2 | nw_order -c a - | nw_topology - | sort | uniq -c
    831 ((s1,(s2,s5)),(s3,s4));
    560 (s1,((s2,s5),(s3,s4)));
      1 (s1,(((s2,s5),s4),s3));
      4 (((s1,s3),(s2,s5)),s4);
      1 ((s1,s3),((s2,s5),s4));
    772 (((s1,s3),s4),(s2,s5));
   7831 ((s1,(s3,s4)),(s2,s5));

Huh?

Here's a run with the seed = 666:

stoat data/bppsim ‹58-mixture-sims*› » grep -v ext=0x0 sts.out | cut -f 2 | nw_order -c a - | nw_topology - | sort | uniq -c
      3 ((((s1,s5),s2),s3),s4);
   3529 ((((s1,s5),s2),s4),s3);
   6444 (((s1,s5),(s2,s4)),s3);
     24 ((s1,s5),((s2,s4),s3));

Rutro. Hopefully I'm doing something wrong.

matsen commented 11 years ago

In general, this is giving some strange results...

stoke bppsim/mixture_sim ‹58-mixture-sims*› » GSL_RNG_SEED=1 ../../../_build/sts -p 1000 combined.fasta | sort -n -r | uniq | head
GSL_RNG_SEED=1
Iter 1 max ll -1088.7 ESS: 1
Iter 2 max ll -1055.34 ESS: 2.58965
Iter 3 max ll -1048.21 ESS: 110.686
Iter 4 max ll -1043.45 ESS: 106.054
-1043.45        (((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.36231,(s3:0.0251951,s4:1.2471):0.0865571);
-1043.57        (((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.25463,(s3:0.0251951,s4:1.2471):0.290782);
-1043.7 (s4:0.092823,(((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.33996,s3:0.0741969):1.22748);
-1043.92        ((s4:1.30228,s3:0.105849):0.544706,((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):0.675433);
-1043.94        ((s4:1.30228,s3:0.105849):0.188523,((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.02658);
-1043.96        (((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):0.950441,(s4:1.30228,s3:0.105849):0.253143);
-1044.04        ((((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.13789,s3:0.0967844):0.866778,s4:0.301352);
-1044.21        ((s4:1.30228,s3:0.105849):0.176417,((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.42359);
-1044.29        ((s4:1.30228,s3:0.105849):1.15506,((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):0.475791);
-1044.32        (((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):0.183618,(s4:1.0169,s3:0.245236):1.08609);
stoke bppsim/mixture_sim ‹58-mixture-sims*› » GSL_RNG_SEED=1 ../../../_build/sts -p 10000 combined.fasta | sort -n -r | uniq | head
GSL_RNG_SEED=1
Iter 1 max ll -1088.24 ESS: 1
Iter 2 max ll -1080.95 ESS: 622.46
Iter 3 max ll -1076.33 ESS: 935.247
Iter 4 max ll -1074.01 ESS: 3423.63
-1074.01        (((s5:3.46078,s2:3.79798):0.0500598,s1:0.103553):0.796032,(s4:1.30079,s3:0.0173021):0.457147);
-1074.33        ((s5:3.46078,s2:3.79798):0.00668878,(s1:0.114969,(s3:0.0563485,s4:1.08045):1.1737):0.011216);
-1074.5 (((s3:0.185624,s4:1.13361):1.17912,s1:0.089926):0.00890922,(s5:3.46078,s2:3.79798):0.0303937);
-1074.58        ((s5:3.46078,s2:3.79798):0.0463491,((s4:1.0767,s3:0.0704241):1.21332,s1:0.128209):0.0682112);
-1074.61        ((s5:3.46078,s2:3.79798):0.0949676,((s3:0.0337284,s4:1.36346):1.47112,s1:0.0615826):0.117568);
-1074.61        ((s5:3.46078,s2:3.79798):0.0448418,(s1:0.114969,(s3:0.0563485,s4:1.08045):1.1737):0.0778398);
-1074.62        ((s1:0.114969,(s3:0.0563485,s4:1.08045):1.1737):0.0264555,(s5:3.46078,s2:3.79798):0.0994435);
-1074.69        ((s5:3.46078,s2:3.79798):0.00357332,(s1:0.205729,(s3:0.134276,s4:1.14999):1.30603):0.00708922);
-1074.71        ((s1:0.135411,(s5:3.46078,s2:3.79798):0.118505):1.15157,(s4:1.05382,s3:0.11448):0.0816781);
-1074.74        ((s1:0.0664278,(s5:3.46078,s2:3.79798):0.110003):1.10267,(s3:0.185624,s4:1.13361):0.250725);
stoke bppsim/mixture_sim ‹58-mixture-sims*› » GSL_RNG_SEED=666 ../../../_build/sts -p 1000 combined.fasta | sort -n -r | uniq | head
GSL_RNG_SEED=666
Iter 1 max ll -1061.67 ESS: 1
Iter 2 max ll -1045.69 ESS: 46.2518
Iter 3 max ll -1033.57 ESS: 15.763
Iter 4 max ll -1023.41 ESS: 137.157
-1023.41        (s3:0.0339332,((s5:0.58922,s1:0.189676):0.375351,(s4:0.697584,s2:0.23578):0.345249):0.652979);
-1023.47        (((s5:0.58922,s1:0.189676):0.375351,(s4:0.697584,s2:0.23578):0.345249):0.170208,s3:0.487317);
-1023.74        (s3:0.666061,((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.164335);
-1023.74        (s3:0.384905,((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.449171);
-1023.75        (s3:0.24561,((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.609824);
-1023.76        (s3:0.582374,((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.281901);
-1023.76        (s3:0.177982,((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.606797);
-1023.79        (((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.685696,s3:0.0804986);
-1023.79        (((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.197177,s3:0.566224);
-1023.81        (((s5:0.58922,s1:0.189676):0.235697,(s2:0.137558,s4:0.793384):0.522587):0.142337,s3:0.613803);
stoke bppsim/mixture_sim ‹58-mixture-sims*› » GSL_RNG_SEED=666 ../../../_build/sts -p 10000 combined.fasta | sort -n -r | uniq | head
GSL_RNG_SEED=666
Iter 1 max ll -1061.71 ESS: 1
Iter 2 max ll -1045.73 ESS: 488.039
Iter 3 max ll -1032.45 ESS: 119.741
Iter 4 max ll -1022.85 ESS: 1278.35
-1022.85        (s3:0.514591,((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.334792);
-1022.85        (s3:0.359605,((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.493625);
-1022.88        (((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.358518,s3:0.530071);
-1022.96        (s3:0.523137,((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.23924);
-1023.03        (s3:0.237056,((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.721196);
-1023.03        (((s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.405304,s4:0.459085):0.0528902,s3:0.690108);
-1023.05        ((s4:0.306404,(s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.549425):0.367111,s3:0.532221);
-1023.05        ((s4:0.306404,(s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.549425):0.0461595,s3:0.858826);
-1023.05        (s3:0.603603,(s4:0.306404,(s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.549425):0.293025);
-1023.06        ((s4:0.306404,(s2:0.0832919,(s5:0.620892,s1:0.175957):0.814703):0.549425):0.852576,s3:0.0754915);
koadman commented 11 years ago

The site weighting fix changes the picture here but still not recovering the mixture. However, the final generation ESS for my runs was only ~4 with 10000 particles, so maybe not worth reading much into it until we're getting decent ESS.

matsen commented 11 years ago

Thanks for doing this, AD. After thinking about it more, it's not clear to me that we should be sampling the mixture. The theorem says that if one starts an MCMC at T1 it may take exponential time to T2. It doesn't say anything about our problem in particular. I have also worked with these mixtures and they are strange.

Check this out:

stoke bppsim/mixturesim ‹58-mixture-sims› » bppml param=check_sts_tree_on_combo.bpp | grep Log Log likelihood.........................: -6400.45691169429 stoke bppsim/mixturesim ‹58-mixture-sims› » bppml param=check_1_on_combo.bpp | grep Log Log likelihood.........................: -6849.82887574946

On Sun, Nov 11, 2012 at 2:55 PM, Aaron Darling notifications@github.comwrote:

The site weighting fix changes the picture here but still not recovering the mixture. However, the final generation ESS for my runs was only ~4 with 10000 particles, so maybe not worth reading much into it until we're getting decent ESS.

— Reply to this email directly or view it on GitHubhttps://github.com/metatangle/sts/issues/58#issuecomment-10273609.

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

koadman commented 11 years ago

Does the above command snippet suggest that sts is finding a tree that is ~450 log units more likely than T1 on the mixture dataset? If so, it seems incredibly unlikely we would ever recover T1. This doesn't bother me too much. If we have a tree mixture then a tree is the wrong model to fit. If there is spatial structure associated with tree changes then this falls in the realm of recombination inference methods.

matsen commented 11 years ago

That is correct, and I agree. The paper setup was explicitly about MCMC-- that if we start in T1 then we won't make it to T2. It never says that T1 or T2 are particularly good models for the data.

The question is: is this simulated data worth keeping around? It's not big. I'd prefer not having it in an orphaned branch.

On Mon, Nov 12, 2012 at 7:52 AM, Aaron Darling notifications@github.comwrote:

Does the above command snippet suggest that sts is finding a tree that is ~450 log units more likely than T1 on the mixture dataset? If so, it seems incredibly unlikely we would ever recover T1. This doesn't bother me too much. If we have a tree mixture then a tree is the wrong model to fit. If there is spatial structure associated with tree changes then this falls in the realm of recombination inference methods.

— Reply to this email directly or view it on GitHubhttps://github.com/metatangle/sts/issues/58#issuecomment-10292517.

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/