PatrickKueck / FASconCAT-G

FASconCAT-G offers a wide range of possibilities to edit and concatenate multiple nucleotide, amino acid, and structure sequence alignment files for phylogenetic and population genetic purposes. The main options include sequence renaming, file format conversion, sequence translation, consensus generation of predefined sequence blocks, and RY coding as well as site exclusions in nucleotide sequences. FASconCAT-G implemented process options can be invoked in any combination and performed during a single process run. FASconCAT-G can also read in and handle different file formats (FASTA, CLUSTAL, and PHYLIP) in a single run.
33 stars 20 forks source link

!FILE-ERROR!: Unknown character found in sequence #6

Open linzhi2013 opened 3 years ago

linzhi2013 commented 3 years ago

Hi Patrick,

I found the program has some problems when an alignment has internal whitespaces.

Say I have two files aln_1.fas and aln_2.fas in the same directory, from which I run the command:

perl /home/gmeng/soft/bin/FASconCAT-G_v1.05.pl  -s -p -p -n -l

the program stopped: image

The content of aln_1.fas file:

>JA12
---------- ---------- ---------- ---------- ---------- ----------
---------- ---------- ---------- ---------- ---------- ----------
---------- ---------- ---------- ---------- ---------- -------cga
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta
tctattatag gggctattaa ttttatttct actatcatta atatgcgaat tataggggtg
>SA1
tctattatag gggctattaa ttttatttct actatcatta atatgcgaat tataggggtg
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta

The content of aln_2.fas file:

>JA12
cctgctcaat gtaaatagcc gcagtactgt gctaaggtag cataatcact tgtttcctaa
aagaaaagat tacgacctcg atgttgaatt aattagtctt aaagcaaaaa ttaaagaaag
tctgttcgac ttataaataa tt
>SA1
ataacagaaa gaggtgttgg ggctggttgg actatttatc cccccttatc tggttcttta
cctgctcaat gtaaatagcc gcagtactgt gctaaggtag cataatcact tgtttcctaa
tctgttcgac ttataaataa tt

The code printing out the Error message was: image

Therefore, it is the above code that cannot handle the whitespace inside an alignment.

If I remove the whitespace, for example, aln_1.1.fas:

>JA12
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------------------cga
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta
tctattataggggctattaattttatttctactatcattaatatgcgaattataggggtg
>SA1
tctattataggggctattaattttatttctactatcattaatatgcgaattataggggtg
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta

aln_2.1.fas:

>JA12
cctgctcaatgtaaatagccgcagtactgtgctaaggtagcataatcacttgtttcctaa
aagaaaagattacgacctcgatgttgaattaattagtcttaaagcaaaaattaaagaaag
tctgttcgacttataaataatt
>SA1
ataacagaaagaggtgttggggctggttggactatttatccccccttatctggttcttta
cctgctcaatgtaaatagccgcagtactgtgctaaggtagcataatcacttgtttcctaa
tctgttcgacttataaataatt

then the program works fine.

Should the program remove the whitespace in the sequences when it reads the alignments? As far as I know, whitespaces are not treated as special characters like - or Ns in an alignment, right? If this is the case, we may safely remove the whitespace in the sequences of alignments.

Cheers Guanliang