ODiogoSilva / TriFusion

Streamlining phylogenomic data gathering, processing and visualization
http://odiogosilva.github.io/TriFusion/
GNU General Public License v3.0
87 stars 23 forks source link

Convert fasta alignment with headers containing whitespace #297

Open DocDer opened 6 years ago

DocDer commented 6 years ago

Fasta sequence headers may contain additional text after the first whitespace, but this is not considered part of the sequence ID. When converting such a file (e.g. to relaxed phylip or nexus) this text is retained and the resulting file contains invalid sequence names.

==COMMAND== TriSeq -in align.fasta -of nexus -c

==EXAMPLE FILE align.fasta==

>KP136829 rbcL CDS
ATGTCACCACAAACAGAGACTAAAACAGGTATTGGGTTCAAAGCTGGTGTTAAAGATTATCGACTAACTTACTATACTCC
CGATTATGAGACCAAAGATACTGACATCTTGGCAGCCTTCCGGATGACTCCGCAACCCGGGGTACCGCCTGAGGAAGCTG
GAGCTGCAGTAGCTGCAGAATCTTCCACAGGTACGTGGACCACTGTGTGGACGGATGGACTGACTAGTCTCGATCGTTAC
AAGGGTCGATGCTACGACATCGAACCCGTTGCTGGGGAAGAGAATCAATATATCGCATATGTAGCTTATCCTTTGGATCT
ATTTGAAGAGGGTTCCGTCACCAATATGTTCACTTCCATTGTAGGTAACGTATTTGGATTTAAAGCCCCACGAGCTCTAC
GTTTGGAGGATCTGAGAATTCCTCCTGCTTATTCCAAGACTTTCATTGGGCCGCCTCACGGTATCCAAGTCGAAAGGGAT
AAACTGAACAAATATGGTCGTCCCTTGCTGGGATGTACAATCAAGCCAAAATTGGGCTTATCTGCTAAAAACTATGGCAG
GGCTGTTTACGAATGTCTCCGTGGCGGACTTGATTTTACGAAGGATGATGAGAACGTAAATTCTCAACCATTCATGCGTT
---GGGACCGGTTCCTGTTTGTGGCAGAAGCTCTTTTCAAGGCTCAGGCCGAAACGGGCGAAATAAAAGGACATTATCTA
AATGCCACTGCGGGTACGTGTGAGGAAATGATGAAAAGAGCAGTCTTTGCTAGAGAATCGGGAGCACCCATCGTCATGCA
TGATTATTTGACGGGAGGCTTCACTGCAAATACTAGCTTGGCCTTTTATTGTCGAGATAATGGGCTACTGCTTCATATCC
ACCGCGCGATGCATGCTGTTATCGATAGACAGAGAAATCACGGTATCCATTTTCGTGTCCTAGCCAAAGCATTGCGTATG
TCCGGCGGGGATCATATCCACGCCGGGACCGTAGTGGGTAAACTGGAGGGAGAACGAGAAGTCACACTGGGTTTCGTCGA
TTTGCTACGCGACGATTATATCGAGAAAGACCGAAGCCGTGGTATATATTTCACTCAGGATTGGGTATCCATGCCAGGTG
TATTTCCCGTAGCCTCGGGAGGTATCCATGTCTGGCATATGCCCGCTCTAACTGAAATCTTCGGAGATGATTCTGTCTCA
CAGTTCGGCGGAGGAACCTTGGGACACCCCTGGGGAAACGCACCAGGCGCCGTAGCTAATCGAGTTGCATTGGAGGCTTG
TGTACAAGCTCGTAATGAGGGACGTGATCTTGCTCGTGAAGGTAACGAGATTATCCGCGAAGCTAGTAAGTGGAGTCCCG
AATTGGCTGCTGCTTGCGAGGTATGGAAACAGATCAAATTTGAATTCGACACAATTGATACATTG---
>KP136830 rbcL CDS
ATGTCACCACAAACGGAGACTAAAGCAGGTGTTGGATTCAAAGCTGGTGTCAAAGATTACCGATTGACCTATTACACCCC
CGAATACAAGACCAAAGATACCGATATCTCAGCAGCTTTCCGAATGACCCCACAACCCGGAGTACCAGCTGAGGAAGCCG
GAGCTGCGGTAGCTGCGGAATCCTCCACGGGTACGTGGACCACTGTATGGACAGATGGGTTGACCAGTCTTGACCGTTAC
AAGGGCCGATGCTACGATATCGAACCCGTCGCTGGGGAGGAGAACCAGTATATTGCGTATGTAGCTTATCCTTTGGATCT
ATTTGAAGAAGGCTCTGTCACCAATTTGTTCACCTCCATTGTAGGTAACGTTTTCGGATTCAAGGCCCTACGCGCCCTAC
GCTTGGAAGACCTTCGAATCCCTCCTGCTTATTCTAAAACTTTCATTGGACCGCCTCACGGTATTCAGGTCGAAAGGGAT
AAACTGAACAAATATGGACGCCCCTTGTTGGGATGTACAATCAAACCAAAATTAGGTCTATCTGCTAAAAATTATGGTAG
AGCCGTCTATGAATGCCTTCGTGGTGGACTTGATTTTACAAAGGATGATGAAAACGTAAATTCCCAGCCATTCATGCGTT
GGAGAGATCGCTTCTTATTCGTAGCAGAAGCCCTTTTCAAATCCCAAGCTGAAACAGGCGAAATCAAGGGGCATTACTTA
AATGCTACTGCAGGTACTTGTGAAGAAATGATGAAGAGAGCTGTTTTTGCTAGAGAATTGGGTGCACCGATTGTCATGCA
TGACTACCTGACCGGAGGGTTTACCGCAAATACCAGCTTAGCTTTTTACTGCAGAGACAATGGACTGCTTCTTCATATTC
ACCGTGCGATGCATGCTGTGATCGACAGACAACGAAATCACGGCATACATTTTCGTGTATTGGCCAAAGCTTTACGCATG
TCCGGTGGGGATCATATACACGCCGGAACTGTAGTAGGCAAACTAGAAGGGGAACGAGAAGTCACTTTGGGTTTCGTCGA
TTTACTCCGCGACGATTATATTGAAAAAGATCGTAGCCGTGGCATCTATTTCACACAAGATTGGGTATCTATGCCGGGTG
TACTCCCCGTAGCTTCGGGGGGGATCCACGTCTGGCACATGCCCGCTCTAACCGAAATCTTTGGGGACGACTCTGTCTTA
CAGTTCGGTGGAGGAACCTTGGGACATCCTTGGGGAAACGCACCTGGTGCGGTAGCCAACCGAGTCGCATTAGAAGCTTG
CGTACAGGCTCGTAATGAGGGTCGCGATCTCGCCCGTGAAGGTAATGAAGTTATTCGTGAAGCTAGTAAGTGGAGTCCGG
AATTGGCTGCTGCATGCGAGATATGGAAAGCAATCAAATTTGAATTTGATACAATTGATACGTTGTAA
ODiogoSilva commented 6 years ago

Hi,

By default, trifusion tries not to interfere with the provided taxa names (with the exception of removing some illegal characters). I understand that having whitespace in the sample names is not accepted by some downstream software, but removing the substring after the first space may not be ideal in every situation. A quick addition could be to replace spaces with "_" symbols in taxa names to avoid problems with downstream analyses; would that be OK for you?

In the meantime I can also think of a way to proceed only with a substring of the taxa name.