bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.02k stars 182 forks source link

Error: invalid character (.) in sequence #383

Open cdiazmun opened 4 years ago

cdiazmun commented 4 years ago

Hello,

I'm running Diamond on a set of samples consisting on FASTA files with aminoacid sequences. I generated those files from the annotation files by using:

gffread file.gff -g file.fa -y aa_file.fa

Then the files consist on fasta headers and the aminoacid sequence. It says that the problem appears in different positions (2408, 2496, 2527, etc) but I look at those positions in the fasta file and there's nothing wrong qith the sequence and specially there's not any "." in the sequence as the error says. I also checked the other related issues with invalid characters but couldn't relate it to my case. I'll submit one of the files here (from a public reference genome). Thank you in advance.

S288C_genome_nomit.fa.gz

bbuchfink commented 4 years ago

There seems to be a . in your sequence:

>g2497.t1 gene=g2497
MNIYTSPTRTPNIAPKSGQRPSLPMLATDERSTDKESPNEDREFVPCSSLDVRRIYPKGPLLVLPEKIYL
YSEPTVKELLPFDVVINVAEEANDLRMQVPAVEYHHYRWEHDSQIALDLPSLTSIIHAATTKREKILIHC
QCGLSRSATLIIAYIMKYHNLSLRHSYDLLKSRADKINPSIGLIFQLMEWEVALNAKTNVQANSYRKKRS
LSSYLSNVSTRREELEKISKQETSEEEDTAGKHEQRETLSEEVSDKFPENVASFRSQTTSVHQATQNNLN
AKESEDLAHKNDASSHEGEVNGDSRPDDVPETNEKISQAIRAKISSSSSSPNVRNVDIQNHQPFSRDQLR
AMLKEPKRKTVDDFIEEEGLGAVEEEDLSDEVLEKNTTEPENVEKDIEYSDSDKDTDDVGSDDPTAPNSP
IKLGRRKLVRGDQLDATTSSMFNNESDSELSDIDDSKNIALSSSLFRGGSSPVKETNNNLSNMNSSPAQN
PKRGSVSRSNDSNKSSHIAVSKRPKQKKGIYRDSGGRTRLQIACDKGKYDVVKKMIEEGGYDINDQDNAG
NTALHEAALQGHIEIVELLIENGADVNIKSIEMFGDTPLIDASANGHLDVVKYLLKNGADPTIRNAKGLT
AFESVDDESEFDDEEDQKILREIKKRLSIAAKKWTNRAGIHNDKSKNGNNAHTIDQPPFDNTTKAKNEKA
ADSPSMASNIDEKAPEEEFYWTDVTSRAGKEKLFKASKEGHLPYVGTYVENGGKIDLRSFFESVKCGHED
ITSIFLAFGFPVNQTSRDNKTSALMVAVGRGHLGTVKLLLEAGADPTKRDKKGRTALYYAKNSIMGITNS
EEIQLIENAINNYLKKHSEDNNDDDDDDDNNNETYKHEKKREKTQSPILASRRSATPRIEDEEDDTRMLN
LADDDFNNDRDVKESTTSDSRKRLDDNENVGTQYSLDWKKRKTNALQDEEKLKSISPLSMEPHSPKKAKS
VEISKIHEETAAEREARLKEEEEYRKKRLEKKRKKEQELLQKLAEDEKKRIEEQEKQKVLEMERLEKATL
EKARKMEREKEMEEISYRRAVRDLYPLGLKIINFNDKLDYKRFLPLYYFVDEKNDKFVLDLQVMILLKDI
DLLSKDNQPTSEKIPVDPSHLTPLWNMLKFIFLYGGSYDDKKNNMENKRYVVNFDGVDLDTKIGYELLEY
KKFVSLPMAWIKWDNVVIENHAKRKEIEGNMIQISINEFARWRNDKLNKAQQPTRKQRSLKIPRELPVKF
QHRMSISSVLQQTSKEPF.FVQTKALSKATLTDLPERWENMPNLEQKEIADNLTERQKLPWKTLNNEEIK
AAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGISFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSK
NANPWGGYSQVQSK
cdiazmun commented 4 years ago

Sorry! I thought that the number that appears before the Error referred to the position (line) in the file, not the entry. I guess that . should be a *. It may has to do with the fact that some genes have introns (like that one, g2497) and in the annotation file (gff) the position with the . is instead an X. I guess gffread transformed it to a ..I'll see more in detail why there's a dot there and if I should remove it or transform it to *. Thank you very much for your prompt answer.

peterthorpe5 commented 3 years ago

@bbuchfink (love the tool!!! - great work. I use it loads!). Would it be possible to make diamond ignore "." or "*", basically translated stops?

bbuchfink commented 3 years ago

A * should already be ignored or treated as a stop. I'm not aware that a . is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can edit src/basic/value.cpp, line 58:

const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);

In the string "UO-" you can add additional characters that should be ignored and treated as X.

yanyew commented 2 years ago

A * should already be ignored or treated as a stop. I'm not aware that a . is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can edit src/basic/value.cpp, line 58:

const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);

In the string "UO-" you can add additional characters that should be ignored and treated as X.

Hello! I wonder that whether the option to ignore certain characters have been added. Thank you!

bbuchfink commented 2 years ago

No sorry, this has not been added.