arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
139 stars 86 forks source link

Bedtools getfasta and ORF from a blastX run #158

Closed StromTroopers closed 4 years ago

StromTroopers commented 4 years ago

Hello I made a blastX research (A query genome in nucleotide format translated into protein in 6 reading frames against a protein dabatase)

And here is a head of the result :

IDBA_scaffold_7517  493 YP_009316100    0.360   61  27  1   150 4   126 186 6.775E-03   37
IDBA_scaffold_29149 519 AIG51500    0.428   35  18  1   2792    2896    392 424 2.712E-03   40

I have 2 hits:

The first one is the query IDBA_scaffold_7517 that have a hit with a protein sequence YP_009316100 in the (-) strand because the start begins by 150 and ends with 4.

The seconde one is the query IDBA_scaffold_29149 that have a hit with a protein sequence AIG51500 in the (+) strand because the start begins 2794 and ends by 2893.

so with that I actually created a bed file in order to extract the fasta sequences from my genome ( I assume that since there is a hit with a protein in the Genomes the coordinates of this hit should be in a good ORF (without stop codon).

Here is the bed_file I made :

IDBA_scaffold_7517  4   150 Species 0   -
IDBA_scaffold_29149 2794    2893    Species 0   +

Then in order to extract the sequences from the coordinates in the Genome I used:

bedtools getfasta -fi Genome.fa -bed bed_file -s -fo test

with -s to force strandedness. If the feature occupies the antisense, strand, the sequence will be reverse complemented.

But when I do that the sequences extracted from the query are not in the good ORF. I guess it is because I should take into account the fact that the coordinates are from nuc translated to portein but how could I do that? Should I change the coordinates manually from the bed file according to the fact that it is a negative of positif strand ?

Thank you for your help.