gem-pasteur / Integron_Finder

Bioinformatics tool to find integrons in bacterial genomes
GNU General Public License v3.0
67 stars 22 forks source link

End location (191284) must be greater than or equal to start location (271412) #85

Closed erichy91 closed 3 years ago

erichy91 commented 3 years ago

Hello, I'm looking for integrons in MAGs, and for some contigs, I have an error message:

End location (191284) must be greater than or equal to start location (271412)

When I remove the --gbk option, the program works correctly. Do you have any idea how I can solve this problem?

Thanks!

Version of Integron_Finder:

Write here the output of integron_finder --version 2.0rc6.

OS

Steps to reproduce behavior

integron_finder --lin mysequence.fasta --promoter-attI --gbk --pdf --local-max --cpu 4

Relevant logs and/or screenshots

Traceback (most recent call last):
  File "/softs/contrib/apps/integron_finder/2.0/bin/integron_finder", line 10, in <module>
    sys.exit(main())
  File "/softs/contrib/apps/integron_finder/2.0/lib/python3.7/site-packages/integron_finder/scripts/finder.py", line 595, in main
    integron_res, summary = find_integron_in_one_replicon(replicon, config)
  File "/softs/contrib/apps/integron_finder/2.0/lib/python3.7/site-packages/integron_finder/scripts/finder.py", line 392, in find_integron_in_one_replicon
    add_feature(replicon, integrons_report, protein_db, config.distance_threshold)
  File "/softs/contrib/apps/integron_finder/2.0/lib/python3.7/site-packages/integron_finder/annotation.py", line 183, in add_feature
    f1 = SeqFeature.FeatureLocation(start_integron_1 - 1, end_integron_1)
  File "/softs/contrib/apps/integron_finder/2.0/lib/python3.7/site-packages/Bio/SeqFeature.py", line 676, in __init__
    self.start))
jeanrjc commented 3 years ago

Hello,

Could you share the offending sequence, so we can try to reproduce the problem ?

Thanks

erichy91 commented 3 years ago

Hi Jeanrjc,

Thank you for your quick answer. Here is the file:

my_contigs.fasta.gz

jeanrjc commented 3 years ago

Ok this is due to a known bug, where for some reasons the proteins of different contigs are reported in every contig (see a mention of this in issue #70, at the end).

Here the .integrons file that we have, when running on everything

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Results_Integron_Finder_my_contigs/Peatland-Pacbio-361-C109.integrons
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ID_integron ID_replicon element pos_beg pos_end strand  evalue  type_elt    annotation  model   type    default distance_2attC  considered_topology
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C51_72  79033   81651   1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C87_68  80963   81361   1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C116_73 81011   82678   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_73  81075   81755   1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_74 81227   81469   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C87_69  81520   82584   1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    attc_001    81524   81583   -1  3.1e-06 attC    attC    attc_4  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_75 81632   83827   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C51_73  81729   83603   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_74  81759   84566   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C87_70  82678   83346   1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C116_74 82679   84484   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C87_71  83445   86222   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C51_74  83624   84070   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_76 84116   85201   -1  8.9e-25 protein intI    intersection_tyr_intI   complete    No  NAlin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C51_260 271413  274100  1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_219 272531  273454  -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_220 273564  273803  1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_221 273927  274313  1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C51_261 274204  275541  1   NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C55_222 274669  275658  -1  NA  protein protein NA  complete    No  NA  lin
integron_02 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C116_113    130832  131791  -1  7.4e-25 protein intI    intersection_tyr_intI   In0 No  NA  lin

And here is what we should get :

ID_integron ID_replicon element pos_beg pos_end strand  evalue  type_elt    annotation  model   type    default distance_2attC  considered_topology
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_74 81227   81469   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    attc_001    81524   81583   -1  3.1e-06 attC    attC    attc_4  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_75 81632   83827   -1  NA  protein protein NA  complete    No  NA  lin
integron_01 Peatland-Pacbio-361-C109    Peatland-Pacbio-361-C109_76 84116   85201   -1  8.9e-25 protein intI    intersection_tyr_intI   complete    No  NAlin

This is actually very problematic bug as it leads to many other issues. Can you try to fix this @bneron ?

This is because .prt files are created for each contig with protein of all the contigs. For instance if we have 14 proteins in 5 contigs, we will have 14 .prt files with 14 proteins in each, instead of 5 .prt files with 1 to 5 proteins in each.

Here we have :

grep ">" -c Results_Integron_Finder_my_contigs/tmp*/*.prt
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C109/Peatland-Pacbio-361-C109.prt:1094
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C51/Peatland-Pacbio-361-C51.prt:1120
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C55/Peatland-Pacbio-361-C55.prt:1120
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C87/Peatland-Pacbio-361-C87.prt:1120

So, in addition to being very annoying, it's very inefficient.

For now, I can only recommend checking that both columns, ID_replicon and ID_element do correspond well for proteins. To bypass this downstream error, you can split you multifastafile into single fasta files, and run IF with a loop.

jeanrjc commented 3 years ago

Ah, apparently, it has been fixed on master, but not released yet.

When running IF from master, the problem disappears, and we indeed have the correct number of protein per file :

grep ">" -c Results_Integron_Finder_my_contigs/tmp*/*.prt
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C109/Peatland-Pacbio-361-C109.prt:185
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C116/Peatland-Pacbio-361-C116.prt:153
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C51/Peatland-Pacbio-361-C51.prt:314
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C55/Peatland-Pacbio-361-C55.prt:264
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C87/Peatland-Pacbio-361-C87.prt:208

Can you make a release @bneron such that people are not impacted by this anymore?

Thanks

erichy91 commented 3 years ago

Perfect, thank you jeanrjc. I will wait for this master to be released.

jeanrjc commented 3 years ago

for the record, how to install from master branch :

$ conda create --name IFv2_env # create an environment
$ conda activate IFv2_env # activate it
$ pip install 'git+https://github.com/gem-pasteur/Integron_Finder/#egg=integron_finder' # install from master
$ integron_finder -V
integron_finder version 2-2021-05-21 # you should get the date of when you did the install
...
$ conda deactivate # quit the environment
$ integron_finder -V
<You should get another IF version like integron_finder version 2.0rc6>