gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
67 stars 22 forks source link

[HELP] How can I get the sequence of whole integron? #105

Closed Galib009 closed 1 year ago

Galib009 commented 2 years ago

I want to get the full sequence of the integron including the gene cassette, is there any option or way to get the sequence from fasta/multi-fasta file?

Many Thanks for this tool

jeanrjc commented 2 years ago

Hello,

With the option --gbk, you can extract the features of the integrons annotated, which you may convert with any gbk parser.

Alternatively, if you're familiar with bash, you can do the following :

# get position for the integron integron_01 in replicon : myreplicon
positions=$(awk 'BEGIN{min=9999999; max=0}$1=="integron_01" && $2=="myreplicon"{if ($4<min) {min=$4}; if ($5>max) {max=$5}}END{print min".."max}' Results_Integron_Finder_myreplicon/*integrons)

# extract sequence with esl-sfetch (gets installed with hmmer, so you should have it)
# first create index for fasta file (need to be done once)
esl-sfetch --index myreplicon.fst
# extract sequence
esl-sfetch -c $positions myreplicon.fst myreplicon > integron_01_myreplicon.fst
Galib009 commented 2 years ago

Hi Jean, Thank you for your reply. However, I am facing this error after the last command "Didn't find sequence myreplicon in the index" - integron_finder_error

here, I have attached the screenshot of the myreplicon.integrons file -

screenshot_myreplicon_integrons

jeanrjc commented 2 years ago

use $2=="contig_4" in your case instead of $2=="myreplicon" (it means "if column 2 is equal to contig_4")

You can also do it by hand and do esl-sfetch -c 1681061..1684879 myreplicon.fst contig_4 > integron_01_contig_4.fst

if you do echo $positions it should return 1681061..1684879 (at least it's supposed to)