gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
64 stars 22 forks source link

Help me understanding the output of IF with a multi-fasta file #87

Closed bramvandijk88 closed 3 years ago

bramvandijk88 commented 3 years ago

Version of Integron_Finder:

OS

I need some help making sense of the output of IF. I'm simply not sure what I'm looking at, and failed to find documentation on what each column means. I ran integrond finder (see command below) on a simple multi-fasta file, which is a metagenomic bin derived from a compost sample (Thiopseudomonas denitrificans). I have many other bins, but this was one of the better ones (checkm reports it as 100% complete and 0.0% contaminated).

command used: integron_finder Bin_Thiopseudomonas_denitrificans.fa --func-annot --pdf

Alright, here's the first 10 lines of the integron_finder_summary:

ID_replicon     ID_integron     complete        In0     CALIN
NODE_8_length_364158_cov_54.952516      integron_01     1       0       0
NODE_18_length_295240_cov_51.484113     integron_01     0       1       0
NODE_44_length_232357_cov_53.606577     integron_01     0       0       1
NODE_44_length_232357_cov_53.606577     integron_02     0       1       0
NODE_45_length_230972_cov_50.035606     integron_01     0       1       0
NODE_46_length_229500_cov_52.887424     integron_01     0       1       0
NODE_53_length_219586_cov_50.408084     integron_01     0       1       0
NODE_101_length_136309_cov_54.219972    integron_01     0       1       0
NODE_118_length_124795_cov_50.152998    integron_01     0       1       0

This reports the integron found on contig "NODE_8" is complete. So then getting the (first 12) lines matching that integron from the output of integron_finder_results:

ID_integron     ID_replicon     element pos_beg pos_end strand  evalue  type_elt        annotation      model   type    default distance_2attC  considered_topology
integron_01     NODE_8_length_364158_cov_54.952516      NODE_53_length_219586_cov_50.408084_212 217493  219586  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_46_length_229500_cov_52.887424_214 217644  219185  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_18_length_295240_cov_51.484113_194 218063  222019  1       NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_8_length_364158_cov_54.952516_213  218189  219178  -1      1e-27   protein intI    intersection_tyr_intI   complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_44_length_232357_cov_53.606577_227 218920  220359  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_45_length_230972_cov_50.035606_214 218968  219582  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_46_length_229500_cov_52.887424_215 219182  220348  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_8_length_364158_cov_54.952516_214  219220  219480  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      attc_001        219508  219566  1       7.8e-06 attC    attC    attc_4  complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_8_length_364158_cov_54.952516_215  219554  219853  -1      NA      protein protein NA      complete        Yes     NA      lin
integron_01     NODE_8_length_364158_cov_54.952516      NODE_45_length_230972_cov_50.035606_215 219603  221861  -1      NA      protein protein NA      complete        Yes     NA      lin

For some of the columns, it is rather obvious what they mean (ID_integron, type_elt, distance_2attC, strand), but for many other I have no clue. Especially the third column seems to confuse me, which appears to contain the names of other contigs in the same file? What does that mean? Is the evalue column related to that?

Thanks for the help!

Best,

Bram

jeanrjc commented 3 years ago

Hello,

this is the same issue as in issues #85 and #70. You can read from https://github.com/gem-pasteur/Integron_Finder/issues/85#issuecomment-839942307

Long story short, it is fixed if you install IntegronFinder from the master branch and not from the release candidate 2.0rc6.

See here how to do that : https://github.com/gem-pasteur/Integron_Finder/issues/85#issuecomment-850339451

jeanrjc commented 3 years ago

Hi again @bramvandijk88 ,

Concerning your question :

Especially the third column seems to confuse me, which appears to contain the names of other contigs in the same file? What does that mean? Is the evalue column related to that?

The third column element is the main column of this table (b/c this table has one element per line). An element can be a protein, an attC site, an attI site, a promoteur, etc... Proteins identifiers are those produced by prodigal, and are basically ID_contig_1 with 1 being the protein numbering done by prodigal along the sequence. Because of the bug, you have proteins from other contigs that are aggregated with the contigs where there is an integron, hence your confusion. The evalue is given for attC or intI, or for protein when using functional annotation, others have NA.

I hope it clears your confusion up ! Best