extracting metadata from GenBank dictionary error

mattbawn commented 4 years ago

Hi Florent,

I have a new install from the docker image.

I was running all tasks and got the following:

ERROR: something went wrong when extracting metadata from GenBank flat files; check errors in '/nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/New_Install/database/logs/extract_metadata_from_gbff.log'
ERROR: Pantagruel pipeline task 0: failed.

When I look in the log I see:

parsing genome annotation from GenBank flat files...
ragout_1.1 ragout_10.1 ragout_100.1 ragout_101.1 ragout_102.1 ragout_103.1 ragout_104.1 ragout_105.1 ragout_106.1 ragout_107.1 ragout_108.1 ragout_109.1 ragout_11.1 ragout_110.1 ragout_111.1 ragout_112.1 ragout_113.1 ragout_114.1 ragout_115.1 ragout_116.1 ragout_117.1 ragout_118.1 ragout_119.1 ragout_12.1 ragout_120.1 ragout_121.1 ragout_122.1 ragout_123.1 ragout_124.1 ragout_125.1 ragout_126.1 ragout_127.1 ragout_128.1 ragout_129.1 ragout_13.1 ragout_130.1 ragout_131.1 ragout_132.1 ragout_133.1 ragout_134.1 ragout_14.1 ragout_15.1 ragout_16.1 ragout_17.1 ragout_18.1 ragout_19.1 ragout_2.1 ragout_20.1 ragout_21.1 ragout_22.1 ragout_23.1 ragout_24.1 ragout_25.1 ragout_26.1 ragout_27.1 ragout_28.1 ragout_29.1 ragout_3.1 ragout_30.1 ragout_31.1 ragout_32.1 ragout_33.1 ragout_34.1 ragout_35.1 ragout_36.1 ragout_37.1 ragout_38.1 ragout_39.1 ragout_4.1 ragout_40.1 ragout_41.1 ragout_42.1 ragout_43.1 ragout_44.1 ragout_45.1 ragout_46.1 ragout_47.1 ragout_48.1 ragout_49.1 ragout_5.1 ragout_50.1 ragout_51.1 ragout_52.1 ragout_53.1 ragout_54.1 ragout_55.1 ragout_56.1 ragout_57.1 ragout_58.1 ragout_59.1 ragout_6.1 ragout_60.1 ragout_61.1 ragout_62.1 ragout_63.1 ragout_64.1 ragout_65.1 ragout_66.1 ragout_67.1 ragout_68.1 ragout_69.1 ragout_7.1 ragout_70.1 ragout_71.1 ragout_72.1 ragout_73.1 ragout_74.1 ragout_75.1 ragout_76.1 ragout_77.1 ragout_78.1 ragout_79.1 ragout_8.1 ragout_80.1 ragout_81.1 ragout_82.1 ragout_83.1 ragout_84.1 ragout_85.1 ragout_86.1 ragout_87.1 ragout_88.1 ragout_89.1 ragout_9.1 ragout_90.1 ragout_91.1 ragout_92.1 ragout_93.1 ragout_94.1 ragout_95.1 ragout_96.1 ragout_97.1 ragout_98.1 ragout_99.1  ...done
ragout_1.1
ragout_1.1; Genus species; "strain"; ; ; 
ragout_10.1
ragout_10.1; Genus species; "strain"; ; ; 
ragout_100.1
Traceback (most recent call last):
  File "/pantagruel/scripts/extract_metadata_from_gbff.py", line 386, in <module>
    main(nfldirassemb, dirassemblyinfo, output, defspename, nfdhandmetaraw, nfdhandmetacur, nfdhanddbxref, verbose=verbose)
  File "/pantagruel/scripts/extract_metadata_from_gbff.py", line 231, in main
    taxid = dict(dbxref.split(':') for dbxref in dmetadata.get('db_xref',{}).get(assemb,na).strip(' "').split(';'))['taxon']
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Just in case, my strain_infos_database.txt looks like:

assembly_id genus   species strain  taxid   locus_tag_prefix
ragout_1    Salmonella  enterica    ERR024387   90371   C1
ragout_2    Salmonella  enterica    ERR024388   90371   C2
ragout_3    Salmonella  enterica    ERR024389   90371   C3
ragout_4    Salmonella  enterica    ERR024391   90371   C4
ragout_5    Salmonella  enterica    ERR024392   90371   C5
ragout_6    Salmonella  enterica    ERR024394   90371   C6
ragout_7    Salmonella  enterica    ERR024395   90371   C7
ragout_8    Salmonella  enterica    ERR024396   90371   C8
ragout_9    Salmonella  enterica    ERR024397   90371   C9
ragout_10   Salmonella  enterica    ERR024398   90371   C10
ragout_11   Salmonella  enterica    ERR024400   90371   C11
ragout_12   Salmonella  enterica    ERR024401   90371   C12
ragout_13   Salmonella  enterica    ERR024402   90371   C13
ragout_14   Salmonella  enterica    ERR024404   90371   C14

Any thoughts?

Thanks,

Matt

flass commented 4 years ago

Hi Matt,

I'm glad you came back to Pantagruel, and sad that it is failing gain at this stage, that should be stable :-/ Can you please indicate what build of the docker image you are using? I'll try ans have a look today. Florent

mattbawn commented 4 years ago

Hi Florent,

It is the latest version on dockerhub.

Thanks

Matt

mattbawn commented 4 years ago

Status	Tag	Commit Source	Created	Last Updated
Success	master-latest	ee0de31	a month ago	a month ago

flass commented 4 years ago

Thank you Matt for the reference. It seems that it is more a failure from another script, add_taxid_feature2prokkaGBK.py which is supposed to add the tax_id information to the GenBank annotation file. This is part of the annotation pipeline when custom assemblies are provided to Pantagruel and it has to run Prokka internally, or when custom annotations are provided (as they're assumed to come out of Prokka) because Prokka does not include a a qualifier for the tax_id when it generates the GenBank flat file.

So just to check, what kind of genome assembly did you provide as input? Was it RefSeq-formatted assemblies (using -A option) or custom assemblies (using -a option)? In the later case, did you provide already annotated assemblies (having a folder called annotation in the custom genome folder passed to -a) or were your genomes annotated within Pantagruel?

It would be helpful to have the command you used for the init and later (00, etc.) pipeline steps.

Can you please also have a look at one of these *_genomic.gbff.gz files in one of the output reformatted assembly folders, for instance this one:

# I'm assuming there should be one named like this, otherwise please find one equivalent that comes from one of your custom assemblies
00.input_data/assemblies/ragout_1/ragout_1.1_Salmonella_enterica_ERR024391/ragout_1.1_Salmonella_enterica_ERR024391_genomic.gbff.gz

and check if there is a line starting by /db_xref="taxon: ; it should be /db_xref="taxon:90371".

Cheers,

Florent

mattbawn commented 4 years ago

Hi Florent,

As always, thank you for the reply.

I am running custom pre-annotated assemblies. My init command was:

pantagruel -d database -r . -a ../genomes/ -T /nbi/Research-Groups/IFR/Rob-Kingsley/R134_Pantagruel/Taxonomy/ncbi-taxonomy-2019-11-07 -I matt.bawn@earlham.ac.uk init

There is an annotation folder, and I used PROKKA to annotate:

ls ../genomes/
annotation                   contigs                strain_infos_database.txt

To run the pipeline I did:

pantagruel -i database/environ_pantagruel_database.sh all

The reformated assembly annotation yielded:

zcat ragout_1.1_PROKKA_11122019_genomic.gbff.gz 
LOCUS       chr                  4893874 bp    DNA     linear       12-NOV-2019
DEFINITION  Genus species strain strain.
ACCESSION   
VERSION
KEYWORDS    .
SOURCE      Genus species
  ORGANISM  Genus species
            Unclassified.
COMMENT     Annotated using prokka 1.13.3 from
            https://github.com/tseemann/prokka.
FEATURES             Location/Qualifiers
     source          1..4893874
                     /organism="Genus species"
                     /mol_type="genomic DNA"
                     /strain="strain"
                     /db_xref="taxon:90371"
     CDS             337..2799

Thanks,

Matt

flass commented 4 years ago

Hi Matt,

sorry for the delay in response.

I was a bit at loss on what was going on, as the GenBank file above looks OK, but actually I realise that the GenBank file for the assembly ragout1.1 seems to have been processed fine; it's the processing from assembly ragout100.1 that leads to the bug. Can you please have a look at this one too?

Also I noticed that you have placeholder information in the DEFINITION filed and in the source feature (qualifiers /organism and /strain ; I'd suggest you correct that in the input files as it will otherwise be carried over in all Pantagruel result files.

mattbawn commented 4 years ago

Hi Florent,

Thanks and sorry for my delay. I have been reannotating my genomes. I now have the following:


Traceback (most recent call last):
  File "/pantagruel/scripts/allgenome_gff2db.py", line 266, in <module>
    main()
  File "/pantagruel/scripts/allgenome_gff2db.py", line 260, in main
    parseAssemb(dirassemb, dfout, dtaxid2sciname=dtaxid2sciname, dmergedtaxid=dmergedtaxid, didentseq=didentseq)
  File "/pantagruel/scripts/allgenome_gff2db.py", line 174, in parseAssemb
    dgeneloctag, dgenenchild = indexRegionsAndGenes(fgff, dfout, assacc, assname, dtaxid2sciname=dtaxid2sciname, dmergedtaxid=dmergedtaxid)
  File "/pantagruel/scripts/allgenome_gff2db.py", line 88, in indexRegionsAndGenes
    tid = int(taxid)
ValueError: invalid literal for int() with base 10: 'Salmonella'
[2020-09-14 13:18:10]
ERROR: inconsistent propagation of the protein dataset:
present in aligned fasta proteome / absent in info table generated from input GFF:

Can you let me know what may be causing this please?

Thanks again,

Matt

flass commented 4 years ago

Hi Matt,

no worries - I was off work myself anyway! Your current error suggests that you have the string 'Salmonella' in the field aimed at the NCBI Taxon Id (I would think something like: /db_xref:"taxon=Salmonella") in your new GenBank flat file. How did you annotate the genomes? Directly with Prokka or with Pantagruel (as a wrapper for Prokka)? Can you please again provide the header of the GenBank flat file for the genome that gets the error? Thanks, Florent

flass commented 4 years ago

error was due to incorrect input files.

flass / pantagruel

extracting metadata from GenBank dictionary error #41