desmid / mview

MView extracts and reformats the results of a sequence database search or multiple alignment.
GNU General Public License v2.0
26 stars 11 forks source link

Deal with asterisk "*" character in inputs #23

Open Aciole-David opened 2 weeks ago

Aciole-David commented 2 weeks ago

Hello! I'd like to kindly ask if is it possible to add support to inputs baring asterisk "*" character. This one I tested was made with sam2fasta.py:

raw

$ cat minitestsam2fastapy.fasta

>NC_045512.2_1bp_to_1680bp
ATTAAAGGTTTATACCTTCCCAGGTAACAAACC
>clone1
ATTAAAGGTTTATACC**CCCAGGTAACAAACC
>clone2
-----------------------------AACC

raw fails

$ mview -in fasta minitestsam2fastapy.fasta

Sequence lengths differ for output format 'mview' - aborting
mview: no alignments found
mview: no alignments found

edited

$ cat minitestsam2fastapy-EDIT.fasta

ATTAAAGGTTTATACCTTCCCAGGTAACAAACC
>clone1
ATTAAAGGTTTATACC--CCCAGGTAACAAACC
>clone2
-----------------------------AACC

edited works

$ mview -in fasta minitestsam2fastapy-EDIT.fasta

Identities normalised by aligned length.

                               cov    pid  1 [        .         .         .  ] 33
1 NC_045512.2_1bp_to_1680bp 100.0% 100.0%    ATTAAAGGTTTATACCTTCCCAGGTAACAAACC   
2 clone1                     93.9%  93.9%    ATTAAAGGTTTATACC--CCCAGGTAACAAACC   
3 clone2                     12.1% 100.0%    -----------------------------AACC   

MView 1.67, Copyright (C) 1997-2020 Nigel P. Brown

(tools) david@NewLinux:/media/david/SSD2a/1-workdir/bam2msa/test$
desmid commented 1 week ago

Hi David,

Historically, the Pearson FASTA format used an optional asterisk as an end marker on each sequence record, like:

> my-sequence
ATTAAAGGTTTATACCTTCCCAGGTAACAAACC
GGTTTATACCTTCCCA*

The last asterisk is not part of the sequence itself and reading the sequence stops at that point. I can't remove this behaviour without breaking older inputs. However, I can skip over internal asterisks as in your example.

The branch https://github.com/desmid/mview/tree/issue-23-handle-asterisks-in-pearson-format makes the change, but notice that if one of your sequences ever actually ends with an asterisk, that will look like the end-of-sequence marker and will be stripped.

Nigel