amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
374 stars 62 forks source link

all branch lengths always `0.000001` with CATG file #162

Closed dlaehnemann closed 8 months ago

dlaehnemann commented 10 months ago

This issue sounds similar to issue #131 , but this should be resolved in the version of RAxML-NG I am using (v1.2.0).

This awkward result could very well be due to my data, but this might also be some bug. In either case, I just want to double-check here. Basically, I have a CATG file with the first header line being 32 10638. So 32 taxa and 10638 genomic sites with genotypes.

As I do have missing data, and the wiki example of a CATG file doesn't have any missing data, I might have mis-encoded the missing data. I have used N for any taxon where there is no data to provide a most-likely genotype, as this sounded like the natural IUPAC code to use. And I put in a uniform likelihood of 0.01 for each genotype in such cases, which should be fine from a computational/statistical point of view, right? But maybe this encoding is somehow not read correctly?

Here's an example record:

AMAAAMAAAAMNAAAANNNNMNAAANNANAAN        0.9871879816055298,0.0,0.0,0.0,0.012812220867857604,0.0,0.0,0.0,0.0,0.0 0.1819639950990677,0.0,0.0,0.0,0.8180361855775118,0.0,0.0,0.0,0.0,0.0   0.8395419716835022,0.0,0.0,0.0,0.16045808886701707,0.0,0.0,0.0,0.0,0.0       0.98683100938797,0.0,0.0,0.0,0.013169037386745686,0.0,0.0,0.0,0.0,0.0   0.985372006893158,0.0,0.0,0.0,0.014628024706070164,0.0,0.0,0.0,0.0,0.0       0.005896709859371185,0.0,0.0,0.0,0.9941030144691467,0.0,0.0,0.0,0.0,0.0 0.9241740107536316,0.0,0.0,0.0,0.07582635227299761,0.0,0.0,0.0,0.0,0.0  0.7468820214271545,0.0,0.0,0.0,0.2531171469017863,0.0,0.0,0.0,0.0,0.0        0.7929589748382568,0.0,0.0,0.0,0.2070399874719442,0.0,0.0,0.0,0.0,0.0   0.6358969807624817,0.0,0.0,0.0,0.3641033714520745,0.0,0.0,0.0,0.0,0.0        0.08862370252609253,0.0,0.0,0.0,0.9113762155175209,0.0,0.0,0.0,0.0,0.0  0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       0.9859390258789062,0.0,0.0,0.0,0.014060986421768007,0.0,0.0,0.0,0.0,0.0      0.9867990016937256,0.0,0.0,0.0,0.013201237804437937,0.0,0.0,0.0,0.0,0.0 0.7650189995765686,0.0,0.0,0.0,0.23498036669479916,0.0,0.0,0.0,0.0,0.0  0.8317210078239441,0.0,0.0,0.0,0.16827980155176192,0.0,0.0,0.0,0.0,0.0       0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01    0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       2.419139946141513e-07,0.0,0.0,0.0,0.9999995185062289,0.0,0.0,0.0,0.0,0.0        0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01    0.9872499704360962,0.0,0.0,0.0,0.012750117551583173,0.0,0.0,0.0,0.0,0.0 0.9868090152740479,0.0,0.0,0.0,0.013190938525085016,0.0,0.0,0.0,0.0,0.0      0.9854879975318909,0.0,0.0,0.0,0.014512016528115623,0.0,0.0,0.0,0.0,0.0 0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01   0.9841110110282898,0.0,0.0,0.0,0.015888730623856873,0.0,0.0,0.0,0.0,0.0  0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01       0.9848629832267761,0.0,0.0,0.0,0.01513656292985388,0.0,0.0,0.0,0.0,0.0       0.9855980277061462,0.0,0.0,0.0,0.014402108243686484,0.0,0.0,0.0,0.0,0.0 0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01

Otherwise, here's the log of the run. Maybe I am also mis-configuring something, but I hope you could spot this from the log:

RAxML-NG v. 1.2.0 released on 09.05.2023 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth, Julia Haag, Anastasis Togkousidis.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 20 cores, 125 GB RAM

RAxML-NG was called at 15-Aug-2023 01:38:49 as follows:

raxml-ng --all --msa results/raxml_ng_input/control.ml_gt_and_likelihoods.catg --model GTGTR+FO+ASC_FELS{95997794} --prefix results/raxml_ng/control --prob-msa on --threads auto{24} --tree pars{5},rand{5} --bs-trees autoMRE{10} --redo

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (5) + parsimony (5)
  bootstrap replicates: parsimony (max: 10) + bootstopping (autoMRE, cutoff: 0.030000)
  random seed: 1692056329
  tip-inner: OFF
  pattern compression: OFF
  per-rate scalers: OFF
  site repeats: OFF
  logLH epsilon: general: 10.000000, brlen-triplet: 1000.000000
  fast spr radius: AUTO
  spr subtree cutoff: 1.000000
  fast CLV updates: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX
  parallelization: coarse-grained (auto), PTHREADS (auto)

WARNING: Running in REDO mode: existing checkpoints are ignored, and all result files will be overwritten!

[00:00:00] Reading alignment from file: results/raxml_ng_input/control.ml_gt_and_likelihoods.catg
[00:00:01] Loaded alignment with 32 taxa and 10638 sites

Alignment comprises 1 partitions and 10638 sites

Partition 0: noname
Model: GT10GTR+FO+ASC_FELS{95997794}
Alignment sites: 10638
Gaps: 39.08 %
Invariant sites: 0.00 %

Parallelization scheme autoconfig: 4 worker(s) x 6 thread(s)

[00:00:01] Generating 5 random starting tree(s) with 32 taxa
[00:00:01] Generating 5 parsimony starting tree(s) with 32 taxa
Parallel parsimony with 24 threads
Parallel reduction/worker buffer size: 1 KB  / 0 KB

[00:00:01] Data distribution: max. partitions/sites/weight per thread: 1 / 1773 / 17730
[00:00:01] Data distribution: max. searches per worker: 6

Starting ML tree search with 10 distinct starting trees

[00:02:10] [worker #2] ML tree search #3, logLikelihood: -530568.188610
[00:02:55] [worker #1] ML tree search #2, logLikelihood: -530676.936851
[00:03:08] [worker #3] ML tree search #4, logLikelihood: -530609.671563
[00:03:54] [worker #0] ML tree search #1, logLikelihood: -530741.064193
[00:04:16] [worker #2] ML tree search #7, logLikelihood: -530908.939626
[00:05:00] [worker #3] ML tree search #8, logLikelihood: -530711.939062
[00:05:46] [worker #1] ML tree search #6, logLikelihood: -530718.690517
[00:06:27] [worker #0] ML tree search #5, logLikelihood: -530943.188697
[00:07:25] [worker #1] ML tree search #10, logLikelihood: -531145.938345
[00:08:49] [worker #0] ML tree search #9, logLikelihood: -530924.603725

[00:08:49] ML tree search completed, best tree logLH: -530568.188610

[00:08:49] Starting bootstrapping analysis with 10 replicates.

[00:09:56] [worker #0] Bootstrap tree #1, logLikelihood: -528178.871646
[00:10:11] [worker #2] Bootstrap tree #3, logLikelihood: -531041.933717
[00:11:05] [worker #3] Bootstrap tree #4, logLikelihood: -530743.251586
[00:11:09] [worker #1] Bootstrap tree #2, logLikelihood: -532071.815403
[00:11:40] [worker #2] Bootstrap tree #7, logLikelihood: -529495.790236
[00:12:00] [worker #0] Bootstrap tree #5, logLikelihood: -531030.124575
[00:12:14] [worker #3] Bootstrap tree #8, logLikelihood: -530656.554783
[00:12:45] [worker #1] Bootstrap tree #6, logLikelihood: -530754.040792
[00:13:24] [worker #0] Bootstrap tree #9, logLikelihood: -530640.605729
[00:14:07] [worker #1] Bootstrap tree #10, logLikelihood: -529609.084846

Optimized model parameters:

   Partition 0: noname
   Rate heterogeneity: NONE
   Base frequencies (ML): 0.082421 0.107135 0.114934 0.081561 0.057580 0.190904 0.043634 0.065069 0.204401 0.052362 
   Substitution rates (ML): 0.001000 0.001000 0.001000 17.697923 23.073127 22.289657 0.001000 0.001000 0.001000 0.001000 0.001000 20.302416 0.001000 0.001000 19.836154 24.445559 0.001000 0.001000 0.001000 23.084650 0.001000 19.965993 0.001000 23.421316 0.001000 0.001000 23.201109 0.001000 24.812040 21.300049 0.001000 0.056605 0.001000 0.001000 0.001000 0.001000 0.001000 0.001000 0.001000 0.540387 0.001000 0.001000 0.001000 0.001000 1.000000 

Final LogLikelihood: -530568.188610

AIC score: 1061364.377220 / AICc score: 1061366.868904 / BIC score: 1062193.406626
Free parameters (model + branch lengths): 114

WARNING: Best ML tree contains 29 near-zero branches!

Best ML tree with collapsed near-zero branches saved to: /abs/path/to/results/raxml_ng/control.raxml.bestTreeCollapsed
Best ML tree saved to: /abs/path/to/results/raxml_ng/control.raxml.bestTree
All ML trees saved to: /abs/path/to/results/raxml_ng/control.raxml.mlTrees
Best ML tree with Felsenstein bootstrap (FBP) support values saved to: /abs/path/to/results/raxml_ng/control.raxml.support
Optimized model saved to: /abs/path/to/results/raxml_ng/control.raxml.bestModel
Bootstrap trees saved to: /abs/path/to/results/raxml_ng/control.raxml.bootstraps

Execution log saved to: /abs/path/to/results/raxml_ng/control.raxml.log

Analysis started: 15-Aug-2023 01:38:49 / finished: 15-Aug-2023 01:52:57

Elapsed time: 848.108 seconds

Consumed energy: 63.696 Wh
amkozlov commented 10 months ago

Hi David,

I think you just hit the lower branch length limit.

A couple of things to try:

dlaehnemann commented 10 months ago

Thanks, will try those.

Generally, do you expect the ascertainment bias correction to actually change branch length ratios, or will it mostly scale the whole tree?

amkozlov commented 10 months ago

I guess asc. bias should mostly affect tree scaling, but maybe have a look at this paper:

https://academic.oup.com/sysbio/article/64/6/1032/1669226

dlaehnemann commented 8 months ago

Turning off ascertainment bias did work, so this was the lower branch length limit, indeed. Many thanks!