flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

Pantagrel pipeline task 0: failed, adding region features to GFF file #25

Closed mattbawn closed 4 years ago

mattbawn commented 4 years ago

Thank you for your previous help.

I have ran

pantagruel -d database -r . -a genomes/ -T /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/Taxonomy/ncbi-taxonomy-2019-11-07 -I matt.bawn@earlham.ac.uk init 

and then

pantagruel -i database/environ_pantagruel_database.sh all 

but Pantagruel fails at task 0. the standard output:

This is Pantagruel pipeline version 76858aaa0b4189a60271b5eb786d924fb8d6441b using source code from repository '/opt/software/pantagruel'
# will run tasks: 0 1 2 3 4 5 6 7 8 9
[2019-11-07 09:43:57] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
Create new task folder '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/database/00.input_data'
[2019-11-07 09:43:57] extract assembly data from folder '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/genomes'
found 134 contig files (raw genome assemblies) in /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/genomes/contigs/
[2019-11-07 09:43:57] ragout_100
found annotation folder '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/genomes/annotation/ragout_100' ; skip annotation of contigs in '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/genomes/contigs/ragout_100.fa'
fix annotation to integrate region information into GFF files

and error:

readlink: missing operand
Try 'readlink --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
ERROR: something went wrong when adding region features to GFF file (In: /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/database/00.input_data/annotation/ragout_100/100_02272018.gff; Out:/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/database/00.input_data/annotation/ragout_100/100_02272018.ptg.gff)
ERROR: Pantagrel pipeline task 0: failed.

the file 100_02272018.ptg.gff contains:

##gff-version 3
##sequence-region chr 1 4898298
##sequence-region C_100_0 1 24265
##sequence-region C_100_1 1 2758
##sequence-region C_100_2 1 1711
##sequence-region C_100_3 1 2738
##sequence-region C_100_4 1 7154
##sequence-region C_100_5 1 2603
##sequence-region C_100_6 1 7007
##sequence-region C_100_7 1 5359
##sequence-region C_100_8 1 2796
##sequence-region C_100_9 1 820
##sequence-region C_100_10 1 636
##sequence-region C_100_11 1 3887
##sequence-region C_100_12 1 1207
##sequence-region C_100_13 1 1697
##sequence-region C_100_14 1 1802
##sequence-region C_100_15 1 560
##sequence-region C_100_16 1 437
##sequence-region C_100_17 1 440
##sequence-region C_100_18 1 881
##sequence-region C_100_19 1 1059
##sequence-region C_100_20 1 523
##sequence-region C_100_21 1 1613
##sequence-region C_100_22 1 674
##sequence-region C_100_23 1 940
##sequence-region C_100_24 1 6229
##sequence-region C_100_25 1 19376
##sequence-region C_100_26 1 1828
##sequence-region C_100_27 1 24445
##sequence-region C_100_28 1 2910
##sequence-region C_100_29 1 375
##sequence-region C_100_30 1 5396
##sequence-region C_100_31 1 1402
##sequence-region C_100_32 1 304
##sequence-region C_100_33 1 426
##sequence-region C_100_34 1 2731
##sequence-region C_100_35 1 1700
##sequence-region C_100_36 1 7180
##sequence-region C_100_37 1 380
chr Prodigal:2.6    CDS 25  711 .   -   0   ID=100_00001;Parent=100_00001_gene;eC_number=2.1.1.-;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P44676;locus_tag=100_00001;product=putative tRNA/rRNA methyltransferase

but the error message is not clear to me.

Thanks,

Matt

flass commented 4 years ago

Hi again Matt,

it seems there are a few issues going on here:

First of all, I would suggest that you have the std output and error streams directed to the same file, as it allows to put the error in the context of what happens in the pipeline sequence.

Then, the error occurs in the call to the script add_region_feature2prokkaGFF.py, which has its own stdin/stderr streams redirected to a sepcific log file, which in your case should be: database/logs/add_region_feature2prokkaGFF/add_region_feature2prokkaGFF.ragout_100.log

having the content of that log would help me to solve your issue; can you post it here please?

F

mattbawn commented 4 years ago

Hi again Florent,

The log file is:

Traceback (most recent call last):
  File "/opt/software/pantagruel/scripts/add_region_feature2prokkaGFF.py", line 118, in <module>
    straininfo = dstraininfo[loctagprefix]
KeyError: 'chr'

Thanks

Matt

flass commented 4 years ago

Hi again Matt,

Ok tso the issue is that the script is trying to extract the locus_tag prefix from the ##sequence-region line of your input GFF file 100_02272018.ptg.gff

My script expects all such line to be of the type: ##sequence-region C_100_0 1 24265 from which the script would successfully extract the string C_100 and would be able to match it with the locus_tag_prefix field in the strain_info_database.txt file.

but in your file the first line appears to be: ##sequence-region chr 1 4898298 And so it fails.

Is your input GFF file generated by prokka? and if yes, by which version?

Florent

mattbawn commented 4 years ago

Yes it was prokka-1.11. looking into a .gff

the top all have the form:

##gff-version 3
##sequence-region chr 1 4893874
##sequence-region C_1_0 1 3130
##sequence-region C_1_1 1 59147
##sequence-region C_1_2 1 1473
##sequence-region C_1_3 1 9821
##sequence-region C_1_4 1 3587
##sequence-region C_1_5 1 999
##sequence-region C_1_6 1 2435
##sequence-region C_1_7 1 1054
##sequence-region C_1_8 1 1832
##sequence-region C_1_9 1 560
##sequence-region C_1_10 1 1038
##sequence-region C_1_11 1 4382
##sequence-region C_1_12 1 1022
##sequence-region C_1_13 1 529
##sequence-region C_1_14 1 565
##sequence-region C_1_15 1 405
##sequence-region C_1_16 1 28374
##sequence-region C_1_17 1 674
##sequence-region C_1_18 1 447
##sequence-region C_1_19 1 1700
##sequence-region C_1_20 1 1210
##sequence-region C_1_21 1 437
##sequence-region C_1_22 1 617
##sequence-region C_1_23 1 1191
##sequence-region C_1_24 1 815

I'll take a look at renaming.

Thanks

Matt

flass commented 4 years ago

Hi Matt,

OK so it must be a legacy behaviour of older prokka versions. I will make something that can deal with this as there is no real constraint that prevent me to do so - better be flexible when one can be.

flass commented 4 years ago

I allowed for this case in f42134d. Please try and let me know if this works.

in general I would advise to use a more recent prokka, especially if it is to be used within pantagruel I can only ensure support for current v1.14.

mattbawn commented 4 years ago

Hi Florent,

I believe these regions in the GFF are contig names rather than locus_tags

##gff-version 3
##sequence-region chr 1 4881495
##sequence-region C_134_0 1 1851
##sequence-region C_134_1 1 17145
##sequence-region C_134_2 1 2756
##sequence-region C_134_3 1 6270
##sequence-region C_134_4 1 4372
##sequence-region C_134_5 1 1700
##sequence-region C_134_6 1 350
##sequence-region C_134_7 1 447
##sequence-region C_134_8 1 437
##sequence-region C_134_9 1 1071
##sequence-region C_134_10 1 576
##sequence-region C_134_11 1 1405
##sequence-region C_134_12 1 1613
##sequence-region C_134_13 1 391
##sequence-region C_134_14 1 674
##sequence-region C_134_15 1 1308
##sequence-region C_134_16 1 560
##sequence-region C_134_17 1 349
##sequence-region C_134_18 1 704
##sequence-region C_134_19 1 1949
##sequence-region C_134_20 1 332
##sequence-region C_134_21 1 547
##sequence-region C_134_22 1 308
##sequence-region C_134_23 1 531
##sequence-region C_134_24 1 617
##sequence-region C_134_25 1 343
chr Prodigal:2.6    CDS 359 1045    .   -   0   ID=C134_00001;eC_number=2.1.1.207;Name=trmL_1;gene=trmL_1;inference=ab initio prediction:Prodigal:2.6,protein motif:HAMAP:MF_01885;locus_tag=C134_00001;product=tRNA (cytidine(34)-2'-O)-methyltransferase
chr Prodigal:2.6    CDS 1681    2397    .   +   0   ID=C134_00002;Name=arcA_1;db_xref=COG:COG0745;gene=arcA_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0A9Q1;locus_tag=C134_00002;product=Aerobic respiration control protein ArcA

I have reannotaed with prokka-1.13.3 and get the same error. Now we're installing the latest version on the cluster and I'll give that a go but I was wondering if you were expecting one contig chromosomes?

Thanks,

Matt

flass commented 4 years ago

Hi Matt,

Okay, so it seems that it is not an old feature of prokka, only one I have not come across before! this is possibly because you have closed contig assemblies and that it is documented somewhere - which I hadn't in my test datasets. You should surely not be penalised for that!

While those are the contig names, prokka names them following the value passed to its --locustag argument, i.e. the locus_tag_prefix field in the strain_info_database.txt file. Really, this is a dirty hack on my side to get this information back from the annotation files, but I should find a way proceed in a cleaner way. The fix I provided in f42134d should work in your case but may fail if the chromosome is the only contig present (this may happen!). I'll update on the final fix here. Let me know in the meantime how running with the last pantagruel version goes.

NB: as explained in #27, you don't need to ask your admins to install a new version of the pipeline every time there is a code update: you can clone the git repository into you home folder where you have r/w permissions, and use the scripts from there. if you do that, please make sure to run the following commands on the first time:

cd /where/to/clone/your/own/pantagruel/
git clone https://github.com/flass/pantagruel.git
cd pantagruel/
git submodule init
git submodule update 

and then for updates to run the couple commands:

cd /where/to/clone/your/own/pantagruel/pantagruel/
git pull
git submodule update 
mattbawn commented 4 years ago

Hi Florent.

Thanks for your help. I may be missing something but I thought in the above case the locus_tag is C134 i.e: 'locus_tag=C134_00001', and the chromosome tag is: C_134. so different. I think this is confusing Pantagruel, could it not just ignore the ## lines?

Thanks,

Matt

flass commented 4 years ago

I realise that your genomes have not been annotated with prokka as part of Pantagruel, or have they? That'd be why they don't behave as I expect them to do... anyway clearly I need to do something else for that particular part. I should be back with it soon.

flass commented 4 years ago

This issue should now be fixed with commit 556b44c.

flass commented 4 years ago

also I found out where the top part of errors came from:

readlink: missing operand
Try 'readlink --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.
dirname: missing operand
Try 'dirname --help' for more information.

These unspecific errors occur in make_prokka_ref_genus_db.sh when prokka is not installed on the server. These errors are irrelevant when you provide you own annotations of the custom assemblies. Now they come as a specific warning first, and as a specific error when you really do need prokka (commit 4867c04).