ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 111 forks source link

create phylogenetic tree with the alignment maf format #1418

Open dandanWang2019 opened 3 months ago

dandanWang2019 commented 3 months ago

Hi,

The alignment is very fast in my case, which is based on one chromosome from different species. Thanks for the great work. I am going to build a phylogenetic tree with the alignments. So I used "cactus-hal2maf" to convert hal to maf format and then to PHYLIP format. The alignments seem output only in blocks.

Does anyone knows how to get a whole alignment sequence for each species rather than sequence within blocks or how to solve this to build tree?

emistasis commented 3 months ago

Hi Dandan!

First, I want to preface that I'm neither an author or collaborator of Cactus. I just use it a lot for my own research and wanted to help.

That said, Dent Earl et al. has a program called mafTools that allows users to work more directly with the MAF file. In particular, the mafToFastaStitcher command will allow you to convert your MAF to a FASTA alignment, which can be converted to a PHYLIP (if desired). Most tree-building software can handle alignments in FASTA format, so you might not need to convert to PHYLIP.

Also, I'd make sure to read through the rest of components that mafTools has to offered.

Hope this helps!

dandanWang2019 commented 3 months ago

Hi Emmarie,

Thanks! It works! I will close this.

dandanWang2019 commented 3 months ago

I reopened this because the sequence length of different species (converted through mafToFastaStitcher) is not equal. The reference genome is a little bit shorter compare to others. This FASTA can always be stated error in the tree-building software.

Hope there is a solution.

aaannaw commented 3 months ago

Hello, I am running mafToFastaStitcher command with test data: /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs input.fa --breakpointPenalty 5 --outMfa output.mfa The input.maf is:

maf version=1

a score=0.0 status=test.input s ref.chr1 10 10 + 100 ACGTACGTAC s seq1.chr@ 0 10 + 100 AAAAAAAAAA s seq2.chr& 10 5 + 100 -----CCCCC s seq6.chr1 10 5 + 100 -----GGGGG s seq7.chr20 0 5 + 100 AAAAA-----

a score=0.0 status=test.input s ref.chr1 20 10 + 100 GTACGTACGT s seq2.chr!! 5 5 + 100 CCCCC----- s seq3.chr0 20 5 + 100 -----GGGGG s seq6.chr1 22 5 + 100 GGGGG-----

a score=0.0 status=test.input s ref.chr1 30 10 + 100 ACGTACGTAC s seq4.chr1 0 5 - 100 GG-----GGG s seq5.chr2 0 10 + 100 CCCCCCCCCC The input.fa is :

ref.chr1 ggggggggggACGTACGTACGTACGTACGTACGTACGTACgg seq1.chr@ AAAAAAAAAAgg seq2.chr& aaaaaaaaaaCCCCCaa seq2.chr!! aaaaaCCCCCaa seq3.chr0 aaaaaaaaaaaaaaaaaaaGGGGGaa seq4.chr1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaCCCCC seq6.chr1 aaaaaaaaaGGGGGaaaaaaaGGGGGaa seq7.chr20 AAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAATT However, I got the error: [1] 3382482 abort (core dumped) /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf

I try split the input.fa into seq1.fa, seq2.fa, seq3.fa, seq4.fa, seq6.fa, seq7.fa and running the command:/data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs ref.fa,seq1.fa,seq2.fa,seq3.fa,seq4.fa,seq6.fa,seq7.fa --breakpointPenalty 5 --outMfa output.mfa, then I got the same error:abort (core dumped)

The compilation for mafToFastaStitcher is correct with make test:

gcc -std=c99 -Wno-unused-but-set-variable -c src/mafToFastaStitcherAPI.c -o test/mafToFastaStitcherAPI.o.tmp -O3 -Wall -Werror --pedantic -funro$ l-loops -DNDEBUG -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-prototypes -I ../../sonLib/lib -I ../inc -I ../external -lm
mv test/mafToFastaStitcherAPI.o.tmp test/mafToFastaStitcherAPI.o
mkdir -p test/
gcc -std=c99 -Wno-unused-but-set-variable -c src/buildVersion.c -o test/buildVersion.o.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -$ ../inc -I ../external
mv test/buildVersion.o.tmp test/buildVersion.o
mkdir -p test/
gcc -std=c99 -Wno-unused-but-set-variable -c src/test.mafToFastaStitcherAPI.c -o test/test.mafToFastaStitcherAPI.o.tmp -O0 -g -Wall -Werror --pe$ antic -I ../../sonLib/lib -I ../inc -I ../external
mv test/test.mafToFastaStitcherAPI.o.tmp test/test.mafToFastaStitcherAPI.o
mkdir -p test/
gcc -std=c99 -Wno-unused-but-set-variable src/allTests.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI.o ../../$ onLib/lib/sonLib.a test/buildVersion.o test/test.mafToFastaStitcherAPI.o -o test/allTests.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/li$ -I ../inc -I ../external -lm
mv test/allTests.tmp test/allTests
mkdir -p test/
gcc -std=c99 -Wno-unused-but-set-variable src/mafToFastaStitcher.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI .o ../../sonLib/lib/sonLib.a test/buildVersion.o -o test/mafToFastaStitcher.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -I ../inc -I ../external -lm mv test/mafToFastaStitcher.tmp test/mafToFastaStitcher ./test/allTests && python2.7 src/test.mafToFastaStitcher.py --verbose && rm -rf ./test/ && rmdir ./tempTestDir Running test case test_readingFasta_0 Running test case test_newBlockHashFromBlock_0 Running test case test_addMafLineToRow_0 Running test case test_addMafLineToRow_1 Running test case test_penalize_0 Running test case test_interstitial_0 Running test case test_addBlockToHash_0 Running test case test_addBlockToHash_1 Running test case test_addBlockToHash_2 Running test case test_addBlockToHash_3 Running test case test_addBlockToHash_4 Running test case test_addBlockToHash_5 Running test case test_addBlockToHash_6 .............

OK (13 tests)

testAllTests (main.CuTest) If valgrind is installed on the system, check for memory related errors in CuTests ... ok testFastaStitch (main.FastaStitchTest) mafToFastaStitcher should produce known output for a given known input ... ok testMemory1 (main.FastaStitchTest) If valgrind is installed on the system, check for memory related errors (1). ... ok


Ran 3 tests in 19.287s

OK Could you give me any suggestions? Looking forward with your reply. Best wishes Na Wan

emistasis commented 3 months ago

I reopened this because the sequence length of different species (converted through mafToFastaStitcher) is not equal. The reference genome is a little bit shorter compare to others. This FASTA can always be stated error in the tree-building software.

Hope there is a solution.

Hi Dandan (@dandanWang2019),

It's hard to say what the problem is. I had a similar issue once, and here's what one of the authors had to say. Based on that, you can try using --gapFill 0 when converting from HAL to MAF. It's possible that there's additional gaps being inserted into your reference sequence when converting?

Also, when you converted your HAL to MAF, what --dupeMode parameter did you set? Is it possible that some of the other sequences may have more duplications written into their FASTA sequence compared to the reference? You can look at your original MAF alignment and see if there are multiple alignment lines for the same species within a block. If so, then I'd use --mafDuplicateFilter from mafTools to filter those (unless you want to preserve them).

emistasis commented 3 months ago

Hello, I am running mafToFastaStitcher command with test data: /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs input.fa --breakpointPenalty 5 --outMfa output.mfa The input.maf is:

maf version=1

a score=0.0 status=test.input s ref.chr1 10 10 + 100 ACGTACGTAC s seq1.chr@ 0 10 + 100 AAAAAAAAAA s seq2.chr& 10 5 + 100 -----CCCCC s seq6.chr1 10 5 + 100 -----GGGGG s seq7.chr20 0 5 + 100 AAAAA-----

a score=0.0 status=test.input s ref.chr1 20 10 + 100 GTACGTACGT s seq2.chr!! 5 5 + 100 CCCCC----- s seq3.chr0 20 5 + 100 -----GGGGG s seq6.chr1 22 5 + 100 GGGGG-----

a score=0.0 status=test.input s ref.chr1 30 10 + 100 ACGTACGTAC s seq4.chr1 0 5 - 100 GG-----GGG s seq5.chr2 0 10 + 100 CCCCCCCCCC The input.fa is :

ref.chr1 ggggggggggACGTACGTACGTACGTACGTACGTACGTACgg seq1.chr@ AAAAAAAAAAgg seq2.chr& aaaaaaaaaaCCCCCaa seq2.chr!! aaaaaCCCCCaa seq3.chr0 aaaaaaaaaaaaaaaaaaaGGGGGaa seq4.chr1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaCCCCC seq6.chr1 aaaaaaaaaGGGGGaaaaaaaGGGGGaa seq7.chr20 AAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAATT However, I got the error: [1] 3382482 abort (core dumped) /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf

I try split the input.fa into seq1.fa, seq2.fa, seq3.fa, seq4.fa, seq6.fa, seq7.fa and running the command:/data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs ref.fa,seq1.fa,seq2.fa,seq3.fa,seq4.fa,seq6.fa,seq7.fa --breakpointPenalty 5 --outMfa output.mfa, then I got the same error:abort (core dumped)

The compilation for mafToFastaStitcher is correct with make test:

gcc -std=c99 -Wno-unused-but-set-variable -c src/mafToFastaStitcherAPI.c -o test/mafToFastaStitcherAPI.o.tmp -O3 -Wall -Werror --pedantic -funro$ l-loops -DNDEBUG -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-prototypes -I ../../sonLib/lib -I ../inc -I ../external -lm mv test/mafToFastaStitcherAPI.o.tmp test/mafToFastaStitcherAPI.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable -c src/buildVersion.c -o test/buildVersion.o.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -$ ../inc -I ../external mv test/buildVersion.o.tmp test/buildVersion.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable -c src/test.mafToFastaStitcherAPI.c -o test/test.mafToFastaStitcherAPI.o.tmp -O0 -g -Wall -Werror --pe$ antic -I ../../sonLib/lib -I ../inc -I ../external mv test/test.mafToFastaStitcherAPI.o.tmp test/test.mafToFastaStitcherAPI.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable src/allTests.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI.o ../../$ onLib/lib/sonLib.a test/buildVersion.o test/test.mafToFastaStitcherAPI.o -o test/allTests.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/li$ -I ../inc -I ../external -lm mv test/allTests.tmp test/allTests mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable src/mafToFastaStitcher.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI .o ../../sonLib/lib/sonLib.a test/buildVersion.o -o test/mafToFastaStitcher.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -I ../inc -I ../external -lm mv test/mafToFastaStitcher.tmp test/mafToFastaStitcher ./test/allTests && python2.7 src/test.mafToFastaStitcher.py --verbose && rm -rf ./test/ && rmdir ./tempTestDir Running test case test_readingFasta_0 Running test case test_newBlockHashFromBlock_0 Running test case test_addMafLineToRow_0 Running test case test_addMafLineToRow_1 Running test case test_penalize_0 Running test case test_interstitial_0 Running test case test_addBlockToHash_0 Running test case test_addBlockToHash_1 Running test case test_addBlockToHash_2 Running test case test_addBlockToHash_3 Running test case test_addBlockToHash_4 Running test case test_addBlockToHash_5 Running test case test_addBlockToHash_6 .............

OK (13 tests)

testAllTests (main.CuTest) If valgrind is installed on the system, check for memory related errors in CuTests ... ok testFastaStitch (main.FastaStitchTest) mafToFastaStitcher should produce known output for a given known input ... ok testMemory1 (main.FastaStitchTest) If valgrind is installed on the system, check for memory related errors (1). ... ok

Ran 3 tests in 19.287s

OK Could you give me any suggestions? Looking forward with your reply. Best wishes Na Wan

Hi Na (@aaannaw),

As I mentioned in my earliest reply, I just want to let you know that I'm not affiliated with either Cactus or mafTools - I'm just a user.

I'm not entirely sure what the problem is, but I suspect that it is related to you providing multiple sequence FASTAs in the second command. Based on the MAF block and input.fa you shared, it seems that the input.fa already contains all of the sequences in the MAF block?

I'd also consider making an issue on mafTools if the issue persists.

aaannaw commented 3 months ago

Hello, I am running mafToFastaStitcher command with test data: /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs input.fa --breakpointPenalty 5 --outMfa output.mfa The input.maf is:

maf version=1

a score=0.0 status=test.input s ref.chr1 10 10 + 100 ACGTACGTAC s seq1.chr@ 0 10 + 100 AAAAAAAAAA s seq2.chr& 10 5 + 100 -----CCCCC s seq6.chr1 10 5 + 100 -----GGGGG s seq7.chr20 0 5 + 100 AAAAA----- a score=0.0 status=test.input s ref.chr1 20 10 + 100 GTACGTACGT s seq2.chr!! 5 5 + 100 CCCCC----- s seq3.chr0 20 5 + 100 -----GGGGG s seq6.chr1 22 5 + 100 GGGGG----- a score=0.0 status=test.input s ref.chr1 30 10 + 100 ACGTACGTAC s seq4.chr1 0 5 - 100 GG-----GGG s seq5.chr2 0 10 + 100 CCCCCCCCCC The input.fa is :

ref.chr1 ggggggggggACGTACGTACGTACGTACGTACGTACGTACgg seq1.chr@ AAAAAAAAAAgg seq2.chr& aaaaaaaaaaCCCCCaa seq2.chr!! aaaaaCCCCCaa seq3.chr0 aaaaaaaaaaaaaaaaaaaGGGGGaa seq4.chr1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaCCCCC seq6.chr1 aaaaaaaaaGGGGGaaaaaaaGGGGGaa seq7.chr20 AAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAATT However, I got the error: [1] 3382482 abort (core dumped) /data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf

I try split the input.fa into seq1.fa, seq2.fa, seq3.fa, seq4.fa, seq6.fa, seq7.fa and running the command:/data/01/p1/user157/software/mafTools/bin/mafToFastaStitcher -m input.maf --seqs ref.fa,seq1.fa,seq2.fa,seq3.fa,seq4.fa,seq6.fa,seq7.fa --breakpointPenalty 5 --outMfa output.mfa, then I got the same error:abort (core dumped) The compilation for mafToFastaStitcher is correct with make test: gcc -std=c99 -Wno-unused-but-set-variable -c src/mafToFastaStitcherAPI.c -o test/mafToFastaStitcherAPI.o.tmp -O3 -Wall -Werror --pedantic -funro$ l-loops -DNDEBUG -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-prototypes -I ../../sonLib/lib -I ../inc -I ../external -lm mv test/mafToFastaStitcherAPI.o.tmp test/mafToFastaStitcherAPI.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable -c src/buildVersion.c -o test/buildVersion.o.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -$ ../inc -I ../external mv test/buildVersion.o.tmp test/buildVersion.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable -c src/test.mafToFastaStitcherAPI.c -o test/test.mafToFastaStitcherAPI.o.tmp -O0 -g -Wall -Werror --pe$ antic -I ../../sonLib/lib -I ../inc -I ../external mv test/test.mafToFastaStitcherAPI.o.tmp test/test.mafToFastaStitcherAPI.o mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable src/allTests.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI.o ../../$ onLib/lib/sonLib.a test/buildVersion.o test/test.mafToFastaStitcherAPI.o -o test/allTests.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/li$ -I ../inc -I ../external -lm mv test/allTests.tmp test/allTests mkdir -p test/ gcc -std=c99 -Wno-unused-but-set-variable src/mafToFastaStitcher.c test/sharedMaf.o test/common.o ../external/CuTest.a test/mafToFastaStitcherAPI .o ../../sonLib/lib/sonLib.a test/buildVersion.o -o test/mafToFastaStitcher.tmp -O0 -g -Wall -Werror --pedantic -I ../../sonLib/lib -I ../inc -I ../external -lm mv test/mafToFastaStitcher.tmp test/mafToFastaStitcher ./test/allTests && python2.7 src/test.mafToFastaStitcher.py --verbose && rm -rf ./test/ && rmdir ./tempTestDir Running test case test_readingFasta_0 Running test case test_newBlockHashFromBlock_0 Running test case test_addMafLineToRow_0 Running test case test_addMafLineToRow_1 Running test case test_penalize_0 Running test case test_interstitial_0 Running test case test_addBlockToHash_0 Running test case test_addBlockToHash_1 Running test case test_addBlockToHash_2 Running test case test_addBlockToHash_3 Running test case test_addBlockToHash_4 Running test case test_addBlockToHash_5 Running test case test_addBlockToHash_6 ............. OK (13 tests) testAllTests (main.CuTest) If valgrind is installed on the system, check for memory related errors in CuTests ... ok testFastaStitch (main.FastaStitchTest) mafToFastaStitcher should produce known output for a given known input ... ok testMemory1 (main.FastaStitchTest) If valgrind is installed on the system, check for memory related errors (1). ... ok Ran 3 tests in 19.287s OK Could you give me any suggestions? Looking forward with your reply. Best wishes Na Wan

Hi Na (@aaannaw),

As I mentioned in my earliest reply, I just want to let you know that I'm not affiliated with either Cactus or mafTools - I'm just a user.

I'm not entirely sure what the problem is, but I suspect that it is related to you providing multiple sequence FASTAs in the second command. Based on the MAF block and input.fa you shared, it seems that the input.fa already contains all of the sequences in the MAF block?

I'd also consider making an issue on mafTools if the issue persists.

Hello,emistasis The showed MAF and input.fa are both from the test data (https://github.com/dentearl/mafTools/tree/master/mafToFastaStitcher). However, I failed to work. I have required help for the author of mafToFastaStitcher, but no reply. Maybe other tools could convert maf to fasta but I have no idea. Best wishes! Na Wan

hhandika commented 3 weeks ago

This seems like a recurring issue. There was a more recent issue with similar problem. You could give SEGUL a try. We don't have support for FASTA reference yet. But, it can get the name from a BED file. The feature is in beta now. It will need a compiling, but should work regardless. Feel free to report issues in SEGUL repo.