amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

SNAP doesn't support full FASTA/IUBMB/IUPAC sequence representation #80

Closed taltman closed 3 years ago

taltman commented 7 years ago

It seems that SNAP only supports the following base symbols: ACTGN, while FASTA supports a richer alphabet coming from the IUBMB/IUPAC:

https://en.wikipedia.org/wiki/FASTA_format

See error message below, where valid FASTA character 'R' (A or G; "puRine") is flagged.

At the very least, SNAP should be aware of the difference between valid characters (e.g., "R-" and invalid ones (e.g., "#*@@#(&$"), so that warnings are thrown only for the latter, and not the former (even if the former are silently converted into 'N' internally).

Loading FASTA file '/dev/shm/taltman/Martin_etal_TextS3_13Dec2011.fasta' into memory...
FASTA file contained a character that's not a valid base (or N): 'R', full line 'CACGCGTCGAAAAGGTAAGTACTTCTTTACCGGGTATGTGTTRATTTTTATGACGTCACT';
converting to 'N'.  This may happen again, but there will be no more warnings.
taltman commented 7 years ago

Also, it would be helpful if SNAP printed out the line number (or the FASTA defline identifier for the current entry) where the invalid character was detected, since SNAP already is noting that internal state.

bolosky commented 7 years ago

There’s essentially no way that SNAP will ever be able to deal gracefully with multi-bases like R, the idea that there are four bases is baked into the code way too deeply. So, it seems reasonable to me to put out a warning when it gets a reference with one of these bases in it to let the user know that it’s going to be treated like an N (which in SNAP never matches anything, even Ns in the read).

I suppose that the warning could be better worded, but I’m inclined to keep it rather than making it go away.

--B

From: Tomer Altman [mailto:notifications@github.com] Sent: Saturday, January 28, 2017 3:15 PM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] SNAP doesn't support full FASTA/IUBMB/IUPAC sequence representation (#80)

It seems that SNAP only supports the following base symbols: ACTGN, while FASTA supports a richer alphabet coming from the IUBMB/IUPAC:

https://en.wikipedia.org/wiki/FASTA_formathttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFASTA_format&data=02%7C01%7Cbolosky%40microsoft.com%7C92c70eade6d44256f82608d447d38144%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212421105443655&sdata=c79gXgzuQPwaajVrD%2BFQYoWeN2ecS3Et9Uz4jPKzeDA%3D&reserved=0

See error message below, where valid FASTA character 'R' (A or G; "puRine").

At the very least, SNAP should be aware of the difference between valid characters (e.g., "R-" and invalid ones (e.g., "#*@@#(&$"), so that warnings are thrown only for the latter, and not the former (even if the former are silently converted into 'N' internally).

Loading FASTA file '/dev/shm/taltman/Martin_etal_TextS3_13Dec2011.fasta' into memory...

FASTA file contained a character that's not a valid base (or N): 'R', full line 'CACGCGTCGAAAAGGTAAGTACTTCTTTACCGGGTATGTGTTRATTTTTATGACGTCACT';

converting to 'N'. This may happen again, but there will be no more warnings.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F80&data=02%7C01%7Cbolosky%40microsoft.com%7C92c70eade6d44256f82608d447d38144%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212421105443655&sdata=zuLn87ObxwrBewogsKn5ngvCRIKykuqIUS7lrfj%2ByLs%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752QJCeZkHXaOQQB6oAiJAq1vfqtFwks5rW8v6gaJpZM4Lwpmg&data=02%7C01%7Cbolosky%40microsoft.com%7C92c70eade6d44256f82608d447d38144%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636212421105443655&sdata=ngcAWYA1QSZmb3EYQ8sm18gulhPqXtWEvju2UVNig5g%3D&reserved=0.