The warning and behaviour regarding illegal characters in taxa names could be improved

terrycojones commented 4 years ago

Hi all

I just got this in the raxml-ng (version 0.7.0) output:

NOTE: Following symbols are not allowed in taxa names to ensure Newick compatibility:
NOTE: " " (space), ";" (semicolon), ":" (colon), "," (comma), "()" (parentheses), "'" (quote). 
NOTE: Please either correct the names manually, or use the reduced alignment file
NOTE: generated by RAxML-NG (see above).

There are a couple of things that could be improved. You could tell the user which character is illegal in the name and give its offset. That would allow people to instantly find and fix these things themselves without having to puzzle over which of the chars is causing the error and where it is.

I know that will probably sound trivial and unnecessary, but I always think the programmer should go the extra yard to give the user the exact reason why something has failed.

The reason this caused me some pause is that I was given a "standard" FASTA file to run just now and it failed as above. I didn't see why because what I consider the 'id' in the sequences are the characters up to the first space. That (unfortunate) behaviour is standard in the bioinformatics world. I think it would make more sense for RAxML-NG to follow this approach too, and drop everything from the first space onwards. Of course this could result in duplicates and that can be checked for.

I personally think the convention of separating things at the first space is really unfortunate. But one has to live with the FASTA produced by other tools. RAxML-NG should probably do the same, IMO.

Anyway, just some friendly feedback! Thanks :-)

terrycojones commented 4 years ago

Also, apologies if this has been fixed in a more recent version.

amkozlov commented 4 years ago

Thanks for your feedback, @terrycojones !

First of all, I totally agree that developers shouldn't be lazy such that users can be :) (seriously)

Now regarding your specific situation:

Just before the error message you posted above, you should see the lines like these:

ERROR: Following taxon name contains invalid characters: id1 comment bla bla bla
ERROR: Following taxon name contains invalid characters: id2 comment bla bla bla

Although I do not print the exact offset of the offending characters here, it should be quite obvious that FASTA comment was considered as part of taxon name, and by reading the NOTE message below, it's straightforward that spaces are the problem. Moreover, RAxML_NG will create an MSA file with fixed taxon names for you ($PREFIX.raxml.reduced.phy), so you can re-run the analysis right away. For my taste, that's quite user-friendly already ;)

Regarding FASTA comment treatment, I'm aware of the issue. Unfortunately, the underlying problem is the lack of (strict!) standards for bionforfmatic file formats (be it FASTA, PHYLIP or NEWICK). As a result, there are multiple "flavors" of each format, since every tool (or a custom perl script...) will print whatever they please, introducing extra spaces, newlines, weird symbols. comments etc.

In case of FASTA, the line after ">" is called "description". I could not find any "official" source which states that what goes after the first space is a comment (see eg. https://en.wikipedia.org/wiki/FASTA_format or https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp). So the following FASTA is perfectly valid:

>Homo sapiens
AAAAA
>Homo erectus
AAACC

Given all that, I would rather make this description cutting behavior optional, something like "--msa-format FASTA+comment". Ideally, there should be a heuristic to autodetect "FASTA" vs. "FASTA+comment" format. Any ideas how this might work?

terrycojones commented 4 years ago

Hey - thanks, you gave me a smile :-)

I think I would just go for something simpler than an --msa-format ARG approach since I don't think there will be more than one way you'll want to do this (assuming you want to do it at all!). I have a FASTA filtering command that provides a --removeDescriptions option (see https://github.com/acorg/dark-matter/blob/master/dark/filter.py#L376) so I guess I'd suggest --ignoreDescriptions or something like that.

Another option in generating your error output would be to print out the offending ids with a line under each to show the location of the issue. So:

The following FASTA description lines contain invalid characters
>Homo sapiens
-----^

That would handle most cases (by which I mean it wouldn't handle unprintable characters). At least in Python I would print using the __repr__ method so the user always gets some kind of representation of the character. In the C/C++ world I would print the int() value of the character too (at least in some cases, e.g., if isprint fails) because I really want to see what's going on. People do wind up with weird crap in their files, sadly (tr -d -c [A-Za-z0-9>] then comes in very handy).

OK, enough words... Thanks for the reply. It's super nice to see people being careful in writing bioinformatics code :-)

Terry

alephreish commented 2 years ago

I've also come across this issue.

In case of FASTA, the line after ">" is called "description". I could not find any "official" source which states that what goes after the first space is a comment - This is not correct. I'm not familiar with any definition of FASTA that would allow multi-word identifiers.

The page https://www.ncbi.nlm.nih.gov/blast/fasta.shtml is not an official description of the FASTA format - it only lists what formats BLAST accepts, FASTA is one of them. https://www.ncbi.nlm.nih.gov/genbank/fastaformat/ would be slightly more to the point although it refers to a subformat of FASTA used for genbank submissions - it explicitly mentions that The SeqID must be unique for each nucleotide sequence and should not contain any spaces.

Here is a sample of just the first pages from a google search:

The word following the ">" symbol is the identifier of the sequence, and the rest of the line is a description (optional) separated form the identifier by a white space or tab. - http://bioinformatics.intec.ugent.be/MotifSuite/fastaformat.php
The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (optional). - http://genetics.bwh.harvard.edu/pph/FASTA.html
One line starting with a ">" sign, followed by a sequence identification code. It is optionally be followed by a textual description of the sequence. Since it is not part of the official description of the format, software can choose to ignore this, when it is present. - https://www.bioinformatics.nl/tools/crab_fasta.html
The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description - https://www.bionity.com/en/encyclopedia/FASTA_format.html
The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column, a sequence identifier should follow ">" immediately, no space is allowed between ">" and an identifier. Other information can be added following the sequence identifier separated by a space, but all information should be in one line. - https://bioinformatics.ysu.edu/tools/docs/fasta.html
Each record in a FASTA file begins with one line header a > character (which must be the first character in the line), a sequence label and optional commentary. - https://genome.sph.umich.edu/wiki/FASTA

alvesjmp commented 1 year ago

The way the definition line works is not an unfortunate convention; it is the way that Bill Pearson (the author of the FASTA alignment program) created it decades ago, apparently (I did not search for the original docs, but it's been like that since at least the 1990s). And it is extremely useful and informative. Nearly all bioinformatics software I've come across respects that format that considers the first "word" after > as the identifier and everything else as a free-form "comment" that can be used any way the user wants. The old RAxML certainly did work fine with any FASTA file like that. Now I came here to check why RAxML-NG was choking on my file that works everywhere else...

It is true that there is an annoying amount of FASTA "flavors" out there, but I think the simple >id [optional comments] format is by far the most common one (also, ignoring spaces in the sequence area). Considering the whole line as the ID is definitely bad and very seldom seen, in my experience.

amkozlov commented 1 year ago

Thanks for your feedback!

Starting with raxml-ng 1.2.0, the default behavior is to truncate FASTA sequence names after the first space.

It is also possible to replace spaces with underscores by adding --msa-format fasta_longlabels option.

amkozlov / raxml-ng

The warning and behaviour regarding illegal characters in taxa names could be improved #90