amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
374 stars 62 forks source link

Making tree from large dataset #181

Closed kevinmyers closed 1 month ago

kevinmyers commented 2 months ago

I want to use RAxML-ng to make a tree for 4,823 species (from NCBI Representative and Reference genome list). I used GTDB-Tk to construct the alignment from the Bac120 marker list for each genome. I am using RAxML-ng v1.2.0 and CentOS 7.

I first ran the parse command: raxml-ng --parse --msa gtdbtk.bac120.user_msa.fasta --model LG+G8+F --prefix T1

It recommended 14 threads and ~1 GB of RAM. I ran this on a cluster requesting 14 CPUs and 4 GB of RAM:

raxml-ng --all --msa T1.raxml.rba --model LG+G8+F --prefix tree --threads 14 --seed 2

I am attaching the output from RAxML-ng. It took over 71 hours to construct the first tree and stopped/was killed after over 1070 hours after making ML tree search 13.

Is there anything I can do to speed up the creating of the tree? I'd like to use RAxML-ng because I've had great success with it in the past, but with a lot fewer samples than this (the most I've tried is around 200).

raxml_out.txt

amkozlov commented 1 month ago

Dear Kevin,

sorry for the late response.

First of all, I'd recommend trying out our new adaptive version of raxml-ng (https://academic.oup.com/mbe/article/40/10/msad227/7296053), which should be substantially faster on alignments with multiple thousand taxa:

https://github.com/amkozlov/raxml-ng/wiki/Installation#building-adaptive-branch

Apart from this:

kevinmyers commented 1 month ago

Thank you for the reply.

I am working to install the adaptive raxml-ng on our server cluster. When it is installed I will run it and use at least 32 threads/cores. I can let you know how it works.

amkozlov commented 1 month ago

Yes, it would be great if you can report back about results.

kevinmyers commented 1 month ago

I had to install cmake via Anaconda, which I have done. the cmake .. command ran fine, but when I tried to run make I got the following error (I'm including the cmake output as well as make error):

(cmake_env) [kmyers@scarcity-1 raxml-ng-adaptive]$ cd build/
(cmake_env) [kmyers@scarcity-1 build]$ cmake ..
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Compiler: GNU 4.8.5 => /usr/bin/c++
-- Building RELEASE
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Using flags: -std=c++11 -Wall -Wextra -D_RAXML_PTHREADS -pthread
-- Building dependencies in: /home/glbrc.org/kmyers/bin/raxml-ng-adaptive/build/localdeps
-- Build type: RELEASE
-- Building coraxlib as a static library.
-- Building documentation: OFF
-- Building tests: OFF
-- Building benchmarks: OFF
-- Building difficutly prediction: ON
-- Enable SSE SIMD kernels: ON
-- Enable AVX SIMD kernels: ON
-- Enable AVX2 SIMD kernels: ON
-- Libs: coraxlib_difficulty_prediction_lib;corax;m
-- Could NOT find GTest (missing: GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY) 
-- GTest not found
CMake Warning at test/src/CMakeLists.txt:5 (message):
  Skipping building tests.

-- Configuring done (9.8s)
-- Generating done (1.1s)
-- Build files have been written to: /home/glbrc.org/kmyers/bin/raxml-ng-adaptive/build
(cmake_env) [kmyers@scarcity-1 build]$ make
[  1%] Building C object libs/coraxlib/lib/difficulty_prediction/src/CMakeFiles/coraxlib_difficulty_prediction_lib.dir/difficulty.c.o
/home/glbrc.org/kmyers/bin/raxml-ng-adaptive/libs/coraxlib/lib/difficulty_prediction/src/difficulty.c: In function ‘corax_msa_predict_difficulty’:
/home/glbrc.org/kmyers/bin/raxml-ng-adaptive/libs/coraxlib/lib/difficulty_prediction/src/difficulty.c:86:3: error: ‘for’ loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < num_features; ++i)
   ^
/home/glbrc.org/kmyers/bin/raxml-ng-adaptive/libs/coraxlib/lib/difficulty_prediction/src/difficulty.c:86:3: note: use option -std=c99 or -std=gnu99 to compile your code
make[2]: *** [libs/coraxlib/lib/difficulty_prediction/src/CMakeFiles/coraxlib_difficulty_prediction_lib.dir/difficulty.c.o] Error 1
make[1]: *** [libs/coraxlib/lib/difficulty_prediction/src/CMakeFiles/coraxlib_difficulty_prediction_lib.dir/all] Error 2
make: *** [all] Error 2

I see on the GitHub page about installing the adaptive raxml-ng to do the following:

Problem #1: CMake fails to find the correct GCC version. Solution: Manually set CXX and CC environment variables, e.g.: CXX=/cm/shared/apps/gcc/5.3.0/bin/gcc CC=/cm/shared/apps/gcc/5.3.0/bin/gcc cmake ..

I cannot find the /cm/ directory on the server or in the Conda directory for the cmake_env.

I have emailed this to our IT staff about the server cluster, but wanted to post it here in case there's something simple I'm missing.

amkozlov commented 1 month ago

you need to replace /cm/shared/apps/gcc/5.3.0/bin/gcc with the actual path to the gcc compiler.

But in the cluster environment, you should probably rather load the corresponding modules tor both cmake and gcc.

Please check your cluster documentation, or wait for IT staff to reply.

kevinmyers commented 1 month ago

The IT staff have compiled the binary and it is working now.

However, I have another question. I always run the parsing option before making the tree, using the following command:

raxml-ng-adaptive --parse --msa gtdbtk.bac120.user_msa.fasta --model LG+G8+F --prefix T1

Previously, with raxml-ng, the parsing ran in seconds, even with the 4800+ genome alignment file. Now it gets to the following and sits for a long time:

[00:00:04] Adaptive mode: Predicting difficulty of the MSA ...

Is this normal? Just want to make sure I'm not doing something wrong.

stamatak commented 1 month ago

Dear Kevin,

Yes, this is normal since the difficulty prediction for the MSA is based on quite a few parsimony tree inferences, normally you would not notice, but on such a large dataset you will.

How difficulty prediction works (which subsequently makes RAxML run faster) is described in this paper here:

https://academic.oup.com/mbe/article/39/12/msac254/6832260

Alexis

On 30.05.24 23:33, Kevin Myers wrote:

The IT staff have compiled the binary and it is working now.

However, I have another question. I always run the parsing option before making the tree, using the following command:

|raxml-ng-adaptive --parse --msa gtdbtk.bac120.user_msa.fasta --model LG+G8+F --prefix T1|

Previously, with raxml-ng, the parsing ran in seconds, even with the 4800+ genome alignment file. Now it gets to the following and sits for a long time:

|[00:00:04] Adaptive mode: Predicting difficulty of the MSA ...|

Is this normal? Just want to make sure I'm not doing something wrong.

— Reply to this email directly, view it on GitHub https://github.com/amkozlov/raxml-ng/issues/181#issuecomment-2140819739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6VKURPNPIDHX5NNXE3ZE6EKDAVCNFSM6AAAAABHJL5SDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQHAYTSNZTHE. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Alexandros (Alexis) Stamatakis

ERA Chair, Institute of Computer Science, Foundation for Research and Technology - Hellas Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.biocomp.gr (Crete lab) www.exelixis-lab.org (Heidelberg lab)

kevinmyers commented 1 month ago

Thanks! Looks like it is running now. The parsing completed in about 6.5 hours. I've set up the tree building. I'll close this out for now. Thanks for your help!