Closed erichy91 closed 3 years ago
Hello,
Could you share the offending sequence, so we can try to reproduce the problem ?
Thanks
Ok this is due to a known bug, where for some reasons the proteins of different contigs are reported in every contig (see a mention of this in issue #70, at the end).
Here the .integrons file that we have, when running on everything
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
Results_Integron_Finder_my_contigs/Peatland-Pacbio-361-C109.integrons
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
ID_integron ID_replicon element pos_beg pos_end strand evalue type_elt annotation model type default distance_2attC considered_topology
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C51_72 79033 81651 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C87_68 80963 81361 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C116_73 81011 82678 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_73 81075 81755 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_74 81227 81469 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C87_69 81520 82584 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 attc_001 81524 81583 -1 3.1e-06 attC attC attc_4 complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_75 81632 83827 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C51_73 81729 83603 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_74 81759 84566 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C87_70 82678 83346 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C116_74 82679 84484 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C87_71 83445 86222 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C51_74 83624 84070 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_76 84116 85201 -1 8.9e-25 protein intI intersection_tyr_intI complete No NAlin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C51_260 271413 274100 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_219 272531 273454 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_220 273564 273803 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_221 273927 274313 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C51_261 274204 275541 1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C55_222 274669 275658 -1 NA protein protein NA complete No NA lin
integron_02 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C116_113 130832 131791 -1 7.4e-25 protein intI intersection_tyr_intI In0 No NA lin
And here is what we should get :
ID_integron ID_replicon element pos_beg pos_end strand evalue type_elt annotation model type default distance_2attC considered_topology
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_74 81227 81469 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 attc_001 81524 81583 -1 3.1e-06 attC attC attc_4 complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_75 81632 83827 -1 NA protein protein NA complete No NA lin
integron_01 Peatland-Pacbio-361-C109 Peatland-Pacbio-361-C109_76 84116 85201 -1 8.9e-25 protein intI intersection_tyr_intI complete No NAlin
This is actually very problematic bug as it leads to many other issues. Can you try to fix this @bneron ?
This is because .prt files are created for each contig with protein of all the contigs. For instance if we have 14 proteins in 5 contigs, we will have 14 .prt files with 14 proteins in each, instead of 5 .prt files with 1 to 5 proteins in each.
Here we have :
grep ">" -c Results_Integron_Finder_my_contigs/tmp*/*.prt
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C109/Peatland-Pacbio-361-C109.prt:1094
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C51/Peatland-Pacbio-361-C51.prt:1120
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C55/Peatland-Pacbio-361-C55.prt:1120
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C87/Peatland-Pacbio-361-C87.prt:1120
So, in addition to being very annoying, it's very inefficient.
For now, I can only recommend checking that both columns, ID_replicon and ID_element do correspond well for proteins. To bypass this downstream error, you can split you multifastafile into single fasta files, and run IF with a loop.
Ah, apparently, it has been fixed on master, but not released yet.
When running IF from master, the problem disappears, and we indeed have the correct number of protein per file :
grep ">" -c Results_Integron_Finder_my_contigs/tmp*/*.prt
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C109/Peatland-Pacbio-361-C109.prt:185
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C116/Peatland-Pacbio-361-C116.prt:153
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C51/Peatland-Pacbio-361-C51.prt:314
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C55/Peatland-Pacbio-361-C55.prt:264
Results_Integron_Finder_my_contigs/tmp_Peatland-Pacbio-361-C87/Peatland-Pacbio-361-C87.prt:208
Can you make a release @bneron such that people are not impacted by this anymore?
Thanks
Perfect, thank you jeanrjc. I will wait for this master to be released.
for the record, how to install from master branch :
$ conda create --name IFv2_env # create an environment
$ conda activate IFv2_env # activate it
$ pip install 'git+https://github.com/gem-pasteur/Integron_Finder/#egg=integron_finder' # install from master
$ integron_finder -V
integron_finder version 2-2021-05-21 # you should get the date of when you did the install
...
$ conda deactivate # quit the environment
$ integron_finder -V
<You should get another IF version like integron_finder version 2.0rc6>
Hello, I'm looking for integrons in MAGs, and for some contigs, I have an error message:
End location (191284) must be greater than or equal to start location (271412)
When I remove the --gbk option, the program works correctly. Do you have any idea how I can solve this problem?
Thanks!
Version of Integron_Finder:
Write here the output of integron_finder --version 2.0rc6.
OS
Steps to reproduce behavior
integron_finder --lin mysequence.fasta --promoter-attI --gbk --pdf --local-max --cpu 4
Relevant logs and/or screenshots