amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
374 stars 62 forks source link

`raxml-ng --parse` does not produce `.rba` file for a `.catg` file as `--msa` input in version 1.1.0 #158

Closed dlaehnemann closed 1 year ago

dlaehnemann commented 1 year ago

According to to the "Preparing the alignment" section of the tutorial:

In addition to MSA sanity check, this command will perform two useful operations:

  1. Compress alignment patterns and store MSA in the binary format (RAxML Binary Alignment, RBA):
    NOTE: Binary MSA file created: T2.raxml.rba

    Since pattern compression could take quite some time for large MSAs, loading RBA file is (much) faster compared to FASTA or PHYLIP.

Thus, when running the following command, I was expecting an results/raxml_ng_parse/control.raxml.rba file to be created:

raxml-ng --parse --msa results/raxml_ng_input/control.ml_gt_and_likelihoods.catg --model GTGTR+FO --prefix results/raxml_ng_parse/control --log DEBUG

However, this file is not created and no NOTE: Binary MSA file created: ... appears in the logs, even with --log DEBUG. What am I doing wrong? Or is the .rba creation simply not supported for CATG files? Or does it not make sense? If so, I'd suggest adding this information to the Wiki, both in the above location and in the section on output files: https://github.com/amkozlov/raxml-ng/wiki/Output:-files-and-settings#output-files

For more details, here's the --log DEBUG output of the above command:

RAxML-NG v. 1.1 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: AMD EPYC 7443P 24-Core Processor, 24 cores, 995 GB RAM

RAxML-NG was called at 31-Mar-2023 14:42:18 as follows:

raxml-ng --parse --msa results/raxml_ng_input/control.ml_gt_and_likelihoods.catg --model GTGTR+FO --prefix results/raxml_ng_parse/control --log DEBUG

Analysis options:
  run mode: Alignment parsing and compression
  start tree(s): 
  random seed: 1680266538
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (auto)

RBA partial loading: OFF
|noname|   |GTGTR+FO|   ||
[00:00:00] Reading alignment from file: results/raxml_ng_input/control.ml_gt_and_likelihoods.catg
Failed to load as IPHYLIP: Unable to parse PHYLIP file: results/raxml_ng_input/control.ml_gt_and_likelihoods.catg
 (LIBPLL-233): Sequence 2 (AAAAAMNAAAAAAAMNAMMNMANMNMAANANA) data out of alignment
Failed to load as PHYLIP: Unable to parse PHYLIP file: results/raxml_ng_input/control.ml_gt_and_likelihoods.catg
 (LIBPLL-232): Sequence 1 (sample_x) longer than expected
Failed to load as FASTA: Error parsing FASTA file: results/raxml_ng_input/control.ml_gt_and_likelihoods.catg
 (LIBPLL-203): Illegal header line in query fasta file
CATG: taxa: 32, sites: 1335471
CATG: taxon 0: sample_x
... [all the other taxons]
CATG: site 0 consesus seq: AAAAAMNAAAAAAAMNANNNMANMNMAANANA
CATG: number of states: 01-Jan-1970 01:00:10
CATG: site 1 consesus seq: AAAAAANMAAAMAMMNANNNAANMNAAANANA
... [all the other sites]
CATG: site 1335470 consesus seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNKG
[00:00:59] Loaded alignment with 32 taxa and 1335471 sites
[00:00:59] Extracting partitions... 
[00:00:59] Checking the alignment...

Alignment comprises 1 partitions and 1335471 sites

Partition 0: noname
Model: GTGTR+FO
Alignment sites: 1335471
Gaps: 61.92 %
Invariant sites: 0.00 %

* Per-taxon CLV size (elements)                : 13354710
* Estimated memory requirements                : 6318 MB

* Recommended number of threads / MPI processes: 108
* Maximum     number of threads / MPI processes: 344
* Minimum     number of threads / MPI processes: 31

Please note that numbers given above are rough estimates only. 
Actual memory consumption and parallel performance on your system may differ!

Alignment can be successfully read by RAxML-NG.

Execution log saved to: /absolut/path/to/results/raxml_ng_parse/control.raxml.log

Analysis started: 31-Mar-2023 14:42:18 / finished: 31-Mar-2023 14:43:18

Elapsed time: 60.382 seconds

Consumed energy: 1.169 Wh
amkozlov commented 1 year ago

Hi David,

you're right: RBA file will not be created for probabilistic alignments (e.g. CATG), mainly because (discrete) pattern compression does not work in this case.

I added a corresponding note to the tutorial.

dlaehnemann commented 1 year ago

Thanks for the quick response!

Quick follow-up question from the note you added: Does VCF input work for regular raxml-ng version 1.1? I had seen it in the CellPhy project, but from looking at the main repo code here in raxml-ng, I though VCF support was so far only implemented on the respective CellPhy branch (which is not yet merged, so not part of the 1.1 release).

amkozlov commented 1 year ago

That's correct. As of now, VCF support is only available in the cellphy branch.

dlaehnemann commented 1 year ago

Thanks for the clarification!