citiususc / veryfasttree

Efficient phylogenetic tree inference for massive taxonomic datasets
Other
111 stars 5 forks source link

Very long runtime/thread lock on a simple alignment with high thread count #2

Closed eernst closed 4 years ago

eernst commented 4 years ago

Thanks for the hard work on this software! It seems VeryFastTree is working like a charm on most input, including very large alignments with over 1B characters. However, on some input it seems to be hanging with only one thread in use (maxed out) when calling with more than 15 threads (example alignment attached):

$ VeryFastTree -threads 16 -double-precision -ext AVX2 OG0008635.fa > test.veryfasttree.OG0008635
Command: VeryFastTree -threads 16 -double-precision -ext AVX OG0008635.fa
VeryFastTree Version 3.0 (OpenMP, AVX2)
Alignment: OG0008635.fa
Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Ignored unknown character * (seen 23 times)
Initial topology in 0.07 seconds
Refining topology: 18 rounds ME-NNIs, 2 rounds ME-SPRs, 9 rounds ML-NNIs
Total branch-length 5.156 after 0.64 sec, 1 of 46 nodes
      0.97 seconds: ML NNI round 1 of 9, 22 splits

OG0008635.fa.zip

niemasd commented 4 years ago

Mine is also getting stuck when specifying multiple threads. I tried the following simple alignment:

test.aln.txt

When I run VeryFastTree single-threaded as follows, it runs just fine:

$ cat test.aln.txt | VeryFastTree -gamma -nt -gtr -double-precision -fastexp 2
Command: VeryFastTree -gamma -nt -gtr -double-precision -fastexp 2
VeryFastTree Version 3.0 (OpenMP, AVX2)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 11 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
Total branch-length 0.140 after 0.00 sec
ML-NNI round 1: LogLk = -542.341 NNIs 0 max delta 0.00 Time 0.00
GTR Frequencies: 0.2480 0.2417 0.2058 0.3044
GTR rates(ac ag at cg ct gt) 1.0260 1.8459 0.8277 0.6470 2.9109 1.0000
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.654 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
ML-NNI round 2: LogLk = -523.940 NNIs 0 max delta 0.00 Time 0.03
Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3: LogLk = -523.940 NNIs 0 max delta 0.00 Time 0.04 (final)
Optimize all lengths: LogLk = -523.940 Time 0.04
Gamma(20) LogLk = -534.502 alpha = 1.284 rescaling lengths by 1.042
Total time: 0.05 seconds Unique: 7/8 Bad splits: 0/4
((BetaCoV_Wuhan_IVDC_HB_01_2019_EPI_ISL_402119:0.0,BetaCoV_Wuhan_IVDC_HB_04_2020_EPI_ISL_402120:0.0):0.004282957,BetaCoV_Wuhan_IPBCAMS_WH_01_2019_EPI_ISL_402123:0.000000006,(BetaCoV_Wuhan_IVDC_HB_05_2019_EPI_ISL_402121:0.000000006,(BetaCoV_Wuhan_WIV04_2019_EPI_ISL_402124:0.000000006,(MT072688_1:0.000000006,(BetaCoV_bat_Yunnan_RaTG13_2013_EPI_ISL_402131:0.049182114,BetaCoV_pangolin_Guangxi_P4L_2017_EPI_ISL_410538:0.091150889)0.924:0.000000006)0.993:0.000000006)0.858:0.004276744)0.852:0.004269224);
TreeCompleted

However, if I try to multithread, it hangs with only 1 thread used, maxed out (the only difference between the following command and the previous one is the -threads 2 at the end):

$ cat test.aln.txt | VeryFastTree -gamma -nt -gtr -double-precision -fastexp 2 -threads 2
Command: VeryFastTree -gamma -nt -gtr -double-precision -fastexp 2 -threads 2
VeryFastTree Version 3.0 (OpenMP, AVX2)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 11 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
Total branch-length 0.008 after 0.00 sec

I tried it with multiple other values for -threads and got the same results. My cmake command when compiling was the following:

cmake -USE_NATIVE=ON -USE_SEE2=ON -USE_SEE4=ON -USE_AVX=ON -USE_AVX2=ON -USE_AVX512=ON .
cesarpomar commented 4 years ago

Hello

I found two errors in the tool and made a commit, the argument -thread-subtrees was uninitialized, so the maximum number of subtrees per thread was not fixes, if this is a problem for you, you can set a large number in this parameter. Moreover, keep in mind that for small datasets rarely the tree can split into smaller trees, you can check the log, and the other parallel regions inherited from fastree only scale for 3-4 threads. The flag that it uses for this calculation was being ignored and the algorithm was stuck looking for a partitioning of the tree.

Check if the problem is solved and I will create a new release.

Thanks for using the tool and for your feedback

niemasd commented 4 years ago

Thanks for looking into it! I tried using the same dataset I mentioned earlier (and I compiled and ran VeryFastTree the same way as before, just using the cloned repo rather than the release tarball), and now I get the following:

Command: VeryFastTree -gamma -nt -gtr -double-precision -fastexp 2 -threads 2
VeryFastTree Version 3.0 (OpenMP, AVX2)
Alignment: standard input
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 11 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
Total branch-length 0.245 after 0.00 sec
The tree has 1 nodes and it was divided into 0 subtrees:
    thread 0(0.00%):roots[], nodes 0
    thread 1(0.00%):roots[], nodes 0
    skipped (100.00%): nodes 1
 total (100.00%), nodes 1, theoretical speedup 1.00 of 2 Time 0.01
ML-NNI round 1: LogLk = -261.997 NNIs 0 max delta 0.00 Time 0.01
The tree has 1 nodes and it was divided into 3 subtrees:
    thread 0(0.00%):roots[0, 1, 2], nodes 0
    thread 1(0.00%):roots[], nodes 0
    skipped (100.00%): nodes 1
 total (100.00%), nodes 1, theoretical speedup 1.00 of 2 Time 0.02
The tree has 1 nodes and it was divided into 3 subtrees:
    thread 0(0.00%):roots[0, 1, 2], nodes 0
    thread 1(0.00%):roots[], nodes 0
    skipped (100.00%): nodes 1
 total (100.00%), nodes 1, theoretical speedup 1.00 of 2 Time 0.03
...
[ it just keeps repeating like this seemingly infinitely ]
cesarpomar commented 4 years ago

I was unable to reproduce the error, the log shows how in a small tree if we use more than one thread, all the nodes go to thread 1. I attach my log, check if I used the same params and if you can give me more details of the error. log.txt

niemasd commented 4 years ago

@cesarpomar Apologies, it seems like it was just that my terminal console was getting flooded with the progress messages. I directed standard error to /dev/null, and it completed quickly as expected

Mine seems to be fully-functioning. Thank you! :-)

eernst commented 4 years ago

With the latest commit, I'm seeing coredumps with some thread counts on the attached test alignment:

$ parallel -j 1 './VeryFastTree -threads {} -double-precision -ext AVX2 OG0008635.fa >/dev/null 2>&1; echo -e "Trying with -threads: {}\t"' ::: {1..16}
Trying with -threads: 1
Trying with -threads: 2
Trying with -threads: 3
Trying with -threads: 4
Trying with -threads: 5
Trying with -threads: 6
Trying with -threads: 7
Trying with -threads: 8
/usr/bin/bash: line 1: 2248788 Aborted                 (core dumped) ./VeryFastTree -threads 8 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 9
/usr/bin/bash: line 1: 2248807 Aborted                 (core dumped) ./VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 10
/usr/bin/bash: line 1: 2248824 Aborted                 (core dumped) ./VeryFastTree -threads 10 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 11
/usr/bin/bash: line 1: 2248844 Aborted                 (core dumped) ./VeryFastTree -threads 11 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 12
/usr/bin/bash: line 1: 2248865 Aborted                 (core dumped) ./VeryFastTree -threads 12 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 13
Trying with -threads: 14
Trying with -threads: 15
/usr/bin/bash: line 1: 2248917 Aborted                 (core dumped) ./VeryFastTree -threads 15 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 16

Maybe a race condition? -threads 9 fails 4/10 times, for example:

$ parallel -j 1 './VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa >/dev/null 2>&1; echo -e "Trying with -threads: 9\t"' ::: {1..10}
Trying with -threads: 9  1
Trying with -threads: 9  2
/usr/bin/bash: line 1: 2249201 Aborted                 (core dumped) ./VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 9  3
Trying with -threads: 9  4
/usr/bin/bash: line 1: 2249229 Aborted                 (core dumped) ./VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 9  5
/usr/bin/bash: line 1: 2249246 Aborted                 (core dumped) ./VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 9  6
/usr/bin/bash: line 1: 2249264 Aborted                 (core dumped) ./VeryFastTree -threads 9 -double-precision -ext AVX2 OG0008635.fa > /dev/null 2>&1
Trying with -threads: 9  7
Trying with -threads: 9  8
Trying with -threads: 9  9
Trying with -threads: 9  10

Compiled on an Intel Xeon 8280 with GCC 5.5.0, configured with: cmake -DUSE_NATIVE=OFF -DUSE_SEE4=ON -DUSE_AVX=ON -DUSE_AVX2=ON -DCMAKE_INSTALL_PREFIX=$PWD .

cesarpomar commented 4 years ago

I think I found the error, you can check if the last commit has solved it.

Thanks for your feedback

eernst commented 4 years ago

Looks good to me now, tested with 1-32 threads and 112 threads for good measure.

Thanks for the quick support!