Closed mattbawn closed 4 years ago
Hi Matt,
I'm glad you came back to Pantagruel, and sad that it is failing gain at this stage, that should be stable :-/ Can you please indicate what build of the docker image you are using? I'll try ans have a look today. Florent
Hi Florent,
It is the latest version on dockerhub.
Thanks
Matt
Status | Tag | Commit Source | Created | Last Updated |
---|---|---|---|---|
Success | master-latest | ee0de31 | a month ago | a month ago |
Thank you Matt for the reference.
It seems that it is more a failure from another script, add_taxid_feature2prokkaGBK.py
which is supposed to add the tax_id information to the GenBank annotation file.
This is part of the annotation pipeline when custom assemblies are provided to Pantagruel and it has to run Prokka internally, or when custom annotations are provided (as they're assumed to come out of Prokka) because Prokka does not include a a qualifier for the tax_id when it generates the GenBank flat file.
So just to check, what kind of genome assembly did you provide as input? Was it RefSeq-formatted assemblies (using -A
option) or custom assemblies (using -a
option)?
In the later case, did you provide already annotated assemblies (having a folder called annotation
in the custom genome folder passed to -a
) or were your genomes annotated within Pantagruel?
It would be helpful to have the command you used for the init
and later (00
, etc.) pipeline steps.
Can you please also have a look at one of these *_genomic.gbff.gz
files in one of the output reformatted assembly folders, for instance this one:
# I'm assuming there should be one named like this, otherwise please find one equivalent that comes from one of your custom assemblies
00.input_data/assemblies/ragout_1/ragout_1.1_Salmonella_enterica_ERR024391/ragout_1.1_Salmonella_enterica_ERR024391_genomic.gbff.gz
and check if there is a line starting by /db_xref="taxon:
; it should be /db_xref="taxon:90371"
.
Cheers,
Florent
Hi Florent,
As always, thank you for the reply.
I am running custom pre-annotated assemblies. My init command was:
pantagruel -d database -r . -a ../genomes/ -T /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/Taxonomy/ncbi-taxonomy-2019-11-07 -I matt.bawn@earlham.ac.uk init
There is an annotation folder, and I used PROKKA
to annotate:
ls ../genomes/
annotation contigs strain_infos_database.txt
To run the pipeline I did:
pantagruel -i database/environ_pantagruel_database.sh all
The reformated assembly annotation yielded:
zcat ragout_1.1_PROKKA_11122019_genomic.gbff.gz
LOCUS chr 4893874 bp DNA linear 12-NOV-2019
DEFINITION Genus species strain strain.
ACCESSION
VERSION
KEYWORDS .
SOURCE Genus species
ORGANISM Genus species
Unclassified.
COMMENT Annotated using prokka 1.13.3 from
https://github.com/tseemann/prokka.
FEATURES Location/Qualifiers
source 1..4893874
/organism="Genus species"
/mol_type="genomic DNA"
/strain="strain"
/db_xref="taxon:90371"
CDS 337..2799
Thanks,
Matt
Hi Matt,
sorry for the delay in response.
I was a bit at loss on what was going on, as the GenBank file above looks OK, but actually I realise that the GenBank file for the assembly ragout1.1
seems to have been processed fine; it's the processing from assembly ragout100.1
that leads to the bug. Can you please have a look at this one too?
Also I noticed that you have placeholder information in the DEFINITION
filed and in the source feature (qualifiers /organism
and /strain
; I'd suggest you correct that in the input files as it will otherwise be carried over in all Pantagruel result files.
Hi Florent,
Thanks and sorry for my delay. I have been reannotating my genomes. I now have the following:
Traceback (most recent call last):
File "/pantagruel/scripts/allgenome_gff2db.py", line 266, in <module>
main()
File "/pantagruel/scripts/allgenome_gff2db.py", line 260, in main
parseAssemb(dirassemb, dfout, dtaxid2sciname=dtaxid2sciname, dmergedtaxid=dmergedtaxid, didentseq=didentseq)
File "/pantagruel/scripts/allgenome_gff2db.py", line 174, in parseAssemb
dgeneloctag, dgenenchild = indexRegionsAndGenes(fgff, dfout, assacc, assname, dtaxid2sciname=dtaxid2sciname, dmergedtaxid=dmergedtaxid)
File "/pantagruel/scripts/allgenome_gff2db.py", line 88, in indexRegionsAndGenes
tid = int(taxid)
ValueError: invalid literal for int() with base 10: 'Salmonella'
[2020-09-14 13:18:10]
ERROR: inconsistent propagation of the protein dataset:
present in aligned fasta proteome / absent in info table generated from input GFF:
Can you let me know what may be causing this please?
Thanks again,
Matt
Hi Matt,
no worries - I was off work myself anyway!
Your current error suggests that you have the string 'Salmonella'
in the field aimed at the NCBI Taxon Id (I would think something like: /db_xref:"taxon=Salmonella"
) in your new GenBank flat file.
How did you annotate the genomes? Directly with Prokka or with Pantagruel (as a wrapper for Prokka)?
Can you please again provide the header of the GenBank flat file for the genome that gets the error?
Thanks,
Florent
error was due to incorrect input files.
Hi Florent,
I have a new install from the docker image.
I was running all tasks and got the following:
When I look in the log I see:
Just in case, my
strain_infos_database.txt
looks like:Any thoughts?
Thanks,
Matt