bioperl / bioperl-live

Core BioPerl 1.x code
http://bioperl.org
299 stars 182 forks source link

Change Nexus syntax in output from Bio::AlignIO::nexus #298

Open nylander opened 6 years ago

nylander commented 6 years ago

Hi,

the nexus syntax output from Bio::AlignIO::nexus is at odds with current nexus standards. For example, the BioPerl module writes

format interleave datatype=dna   gap=- symbols="CTANG";

but the software paup*, written by one of the inventors of the nexus format, complains with the following message:

Error(#329): User-defined symbol 'A' conflicts with predefined DNA state symbol.

                 If you are using a predefined format ('DNA', 'RNA', 'nucleotide', or 'protein'), you
                 may not specify predefined states for this format as symbols in the Format command.

I suggest we comply - by having the Bio::AlignIO::nexus module not output the string symbols="CTANG" (if datatype=dna is already written). I assume this also applies to the other predefined alphabets ('RNA', 'nucleotide', or 'protein').

Cheers Johan

nylander commented 6 years ago

For a formal reference, see: Maddison, Swofford, Maddison. 1997. Nexus: An Extensible File Format for Systematic Information https://doi.org/10.1093/sysbio/46.4.590, in which we can read (p.599):

"For STANDARD DATATYPEs, a SYMBOLS subcommand will replace the default symbols list of "0 1". For DNA, RNA, NUCLEOTIDE, and PROTEIN DATATYPEs, a SYMBOLS subcommand will not replace the default symbols list, but will add character-state symbols to the SYMBOLS list."

Hence, adding, e.g., symbols C,T,A,G when they are already defined doesn't make sense (and causing some software to throw an error).