Closed frankMusacchia closed 10 years ago
Hi Francesco,
It isn't an error because X is a valid code in a protein sequence (and so are G, A, T and C). Check out these links for valid amino acid and nucleotide codes. Bio::SeqIO provides the -alphabet
option to explicitly declare the sequences as dna
, rna
or protein
for situations like yours.
Hope this helps, Paul
Oh forgot to close issue
This is not a BioPerl bug but a bug of the sequence itself, because a nucleotide sequence should always use N for an unknown base and not X, since X is reseved for unknown aminoacid or Xanthine (but Xanthine is part of the degradation pathway of Guanine, so it will not be in the DNA sequence).
See IUPAC nomenclature rules:
BioPerl gets confused and assume X refers to unknown aminoacid because its the standard meaning, and A-T-C-G letters are also used for aminoacids. If you can't control that your sequence source give Xs instead of Ns for unknown bases, and you don't want to change the Xs for Ns in your sequence files, you can always overwrite the alphabet guess performed by BioPerl and set the alphabet yourself by using "$obj->alphabet('dna')" to set it to DNA. =)
Cheers.
Oops, I took too long to write that message! Thanks Paul! :+1:
Thank you! Sorry for this bug report... I should have read better the IUPAC rules before.
bests, Francesco
2014-09-12 17:09 GMT+02:00 Francisco J. Ossandon notifications@github.com:
Oops, I took too long to write that message! Thanks Paul! [image: :+1:]
— Reply to this email directly or view it on GitHub https://github.com/bioperl/bioperl-live/issues/82#issuecomment-55416340.
Francesco Musacchia - Ph.D. Bioinformatics - Animal Physiology and Evolution Stazione Zoologica Anton Dohrn Villa Comunale, 80121 Napoli - Italy
I had a problem with a fasta file to identify the type of sequences. The following code:
!/usr/bin/perl
use strict; use warnings;
SEQUENCE MANAGEMENT MODULES
use Bio::SeqIO;
my $fastaFilePath = $ARGV[0]; my $seqio = Bio::SeqIO->new( -file => $fastaFilePath, -format=>'fasta'); my $obj = $seqio->next_seq(); print $obj->alphabet;
Returns: proteins
The first lines of the file are the following (here I cannot upload the Fasta file):
I already identified the problem.. it is in the presence of the Xs. When I just substitute N to X then the script works well. But isn't it an error to display protein with these sequencese?
best regards
Francesco