flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

Cannot access .../*.gbk files after running prokka for custom genomes #27

Closed megaptera-helvetiae closed 4 years ago

megaptera-helvetiae commented 4 years ago

Hi Florent, this is a more specific issue.

I have manually downloaded the Taxonomy files as described in issue #24 for for Matt Bawn (https://github.com/flass/pantagruel/issues/24#issuecomment-550270892). I also had to manually install your module "tree2".

Then, I ran the following commands:

pantagruel -d database -r foot_folder -a user_genomes -T NCBI/Taxonomy_2019-11-12/ init

# --> no error message!
pantagruel -i /root_folder/database/environ_pantagruel_test7.sh fetch

# --> some errors!

[wilkins@gorilla Panta]$ pantagruel -i /scratch/clamchatka/Panta/test7/environ_pantagruel_test7.sh fetch
This is Pantagruel pipeline version 9531df2e57fd032f1ff8e11b79091953833f978e using source code from repository '/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
# will run tasks: 0
[2019-11-13 04:47:23] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
Create new task folder '/scratch/clamchatka/Panta/test7/00.input_data'
[2019-11-13 04:47:24] extract assembly data from folder '/scratch/clamchatka/Panta/user_genomes'
found 14 contig files (raw genome assemblies) in /scratch/clamchatka/Panta/user_genomes/contigs/
[2019-11-13 04:47:24] Ctena_galapagana_StHelenaBay_001_SYM
will annotate contigs in '/scratch/clamchatka/Panta/user_genomes/contigs/Ctena_galapagana_StHelenaBay_001_SYM.fasta'
[2019-11-13 04:47:24]
### assembly: Ctena_galapagana_StHelenaBay_001_SYM; contig files from: /scratch/clamchatka/Panta/user_genomes/contigs/Ctena_galapagana_StHelenaBay_001_SYM.fasta
running Prokka...
done.
[2019-11-13 04:49:25]
fix annotation to integrate region information into GFF files
fix annotation to integrate taxid information into GBK files
ls: cannot access /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/*.gbk: No such file or directory
done.

[2019-11-13 05:18:29]
fix annotation to integrate region information into GFF files
fix annotation to integrate taxid information into GBK files
ls: cannot access /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM/*.gbk: No such file or directory
done.
will create GenBank-like assembly folders for user-provided genomes
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_051_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_052_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_065_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_068_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_070_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_073_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_074_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_imbricatula_STRI_094_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_011_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_012_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_013_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM2/ is a directory -- ignored
Traceback (most recent call last):
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 366, in <module>
    main(nfldirassemb, dirassemblyinfo, output, defspename, nfdhandmetaraw, nfdhandmetacur, nfdhanddbxref)
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 72, in main
    lassemb = [parse_assembly_name(assembname, reass=reass) for assembname in lassembname]
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 60, in parse_assembly_name
    geass = seass.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
[2019-11-13 05:20:08]
Pantagrel pipeline task 0: complete.

Is this a problem with permissions? The file it claims missing is there!

See here:

[wilkins@gorilla Panta]$ cd /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM2/
[wilkins@gorilla Ctena_mexicana_StHelenaBay_014_SYM2]$ ls -l
total 66506
-rw-r--r--. 1 wilkins login      115 Nov 13 05:16 Ctena_mexicana_Hele014b.ecn
-rw-r--r--. 1 wilkins login  1840709 Nov 13 05:16 Ctena_mexicana_Hele014b.err
-rw-r--r--. 1 wilkins login  1565273 Nov 13 05:16 Ctena_mexicana_Hele014b.faa
-rw-r--r--. 1 wilkins login  4256068 Nov 13 05:16 Ctena_mexicana_Hele014b.ffn
-rw-r--r--. 1 wilkins login     2947 Nov 13 05:16 Ctena_mexicana_Hele014b.fixedproducts
-rw-r--r--. 1 wilkins login  4884188 Nov 13 05:14 Ctena_mexicana_Hele014b.fna
-rw-r--r--. 1 wilkins login  4920543 Nov 13 05:16 Ctena_mexicana_Hele014b.fsa
-rw-r--r--. 1 wilkins login 10889519 Nov 13 05:16 Ctena_mexicana_Hele014b.gbf
-rw-r--r--. 1 wilkins login  7008613 Nov 13 05:16 Ctena_mexicana_Hele014b.gff
-rw-r--r--. 1 wilkins login    64441 Nov 13 05:16 Ctena_mexicana_Hele014b.log
-rw-r--r--. 1 wilkins login 10919925 Nov 13 05:16 Ctena_mexicana_Hele014b.ptg.gbk
-rw-r--r--. 1 wilkins login 19900768 Nov 13 05:16 Ctena_mexicana_Hele014b.sqn
-rw-r--r--. 1 wilkins login  1362720 Nov 13 05:16 Ctena_mexicana_Hele014b.tbl
-rw-r--r--. 1 wilkins login   477854 Nov 13 05:16 Ctena_mexicana_Hele014b.tsv
-rw-r--r--. 1 wilkins login      102 Nov 13 05:16 Ctena_mexicana_Hele014b.txt
-rw-r--r--. 1 wilkins login     3105 Nov 13 05:16 Ctena_mexicana_Hele014b.val
-rw-r--r--. 1 wilkins login      132 Nov 13 05:16 errorsummary.val
pantagruel -i /root_folder/database/environ_pantagruel_test7.sh homologous

# --> lots of errors!

[wilkins@gorilla Panta]$ pantagruel -i /scratch/clamchatka/Panta/test7/environ_pantagruel_test7.sh homologous
This is Pantagruel pipeline version 9531df2e57fd032f1ff8e11b79091953833f978e using source code from repository '/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
# will run tasks: 1
[2019-11-13 05:23:00] Pantagrel pipeline task 1: classify protein sequences into homologous families.
Create new task folder '/scratch/clamchatka/Panta/test7/01.seqdb'
[2019-11-13 05:23:01] -- 14 proteomes in dataset
[2019-11-13 05:23:01] -- 56061 proteins in dataset
[2019-11-13 05:23:02] -- 56061 non-redundant protein ids in dataset
                      -- Perform first protein clustering step (100% prot identity clustering with clusthash algorithm)
                      -- First protein clustering step complete: 
Writing results 0h 0m 0s 14ms
Time for merging to all_proteomes.clusthashdb_minseqid100_clust: 0h 0m 0s 15ms
nfin = '/scratch/clamchatka/Panta/test7/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' ; famprefix = 'NRPROT' ; dirout = '/scratch/clamchatka/Panta/test7/01.seqdb/all_proteomes.clusthashdb_minseqid100_families' ; padlen = 6 ; writeseq = False ; discardsingle = False
listed 34956 redundant sequences in dataset
generated hash index
parsing redundant sequence fasta
filtered 21105 non-redundant sequences
parse NCBI Taxonomy merged taxon ids from '/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/merged.dmp'
parse NCBI Taxonomy taxon names from '/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/names.dmp'
parse redundant protein names from '/scratch/clamchatka/Panta/test7/01.seqdb/all_proteomes.identicals.list'
parse assembly '/scratch/clamchatka/Panta/test7/00.input_data/assemblies/Ctena_galapagana_StHelenaBay_001_SYM.1_'
Traceback (most recent call last):
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/allgenome_gff2db.py", line 252, in <module>
    main()
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/allgenome_gff2db.py", line 246, in main
    parseAssemb(dirassemb, dfout, dtaxid2sciname=dtaxid2sciname, dmergedtaxid=dmergedtaxid, didentseq=didentseq)
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/allgenome_gff2db.py", line 156, in parseAssemb
    assacc, assname = assembsearch.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
[2019-11-13 05:23:17]
ERROR: inconsistent propagation of the protein dataset:
present in aligned fasta proteome / absent in info table generated from input GFF:
>
> CteimSTRI094_00001
> CteimSTRI094_00015
.
.
> CteimSTRI094_04157
> CteimSTRI094_04161
> CteimSTRI094_04165
> CteimSTRI094_04170
> CteimSTRI094_04171
> CteimSTRI094_04188
present in info table generated from input GFF / absent in aligned fasta proteome:
ERROR: Pantagrel pipeline task 1: failed.

I cannot trace back what is going wrong here.

flass commented 4 years ago

Hi Laetitia,

(first of all I allowed myself to edit the formatting of your post so it's more readable; I hope you don't mind. I suggest you use the GitHub Markdown syntax e.g. enclosing commands and logs into triple backquote ``` so to clarify what's code and comments)

A general comment: you should not try to run downstream pipeline tasks when upstream pipeline tasks have failed, it's bound to fail too. Only task 04 is not required for subsequent tasks (as explained here). So when you've got an error, please try and fix the problem first.

About your error now! the first error feeds from ls are misleading e.g. in

ls: cannot access /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/*.gbk: No such file or directory

are misleading; no files are missing, I think the script found an alternative file, in this case Ctena_mexicana_Hele014b.gbf and could from there generate the modified GenBank file Ctena_mexicana_Hele014b.ptg.gbk. I suppressed that false error in commit 5c9db97.

the real problem seem to be the absence of modified GFF file Ctena_mexicana_Hele014b.ptg.gff from the folder /scratch/clamchatka/Panta/test7/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM2/ I think this issue, which somehow came unreported (but I should have fixed that too in commit 5c9db97), stems from an original failure in the Prokka annotation for those genomes, as indicated by a non-empty errorsummary.val file in that same folder. So could you have a look at the content of that Prokka error log file and/or post it here please so I can to try and do a diagnostic?

Cheers,

Florent

flass commented 4 years ago

ps: you seem to be running a dated version 9531df2 (2 months old); I made quite a few fixes to that part since, you should definitely updated the repository, cf. #26.

megaptera-helvetiae commented 4 years ago

Hi Florent,

here is the output from errorsummary.val


total 56667
-rw-r--r--. 1 wilkins login      169 Nov 13 05:09 Ctena_mexicana_Hele011.ecn
-rw-r--r--. 1 wilkins login   885124 Nov 13 05:09 Ctena_mexicana_Hele011.err
-rw-r--r--. 1 wilkins login  1459440 Nov 13 05:09 Ctena_mexicana_Hele011.faa
-rw-r--r--. 1 wilkins login  4015993 Nov 13 05:09 Ctena_mexicana_Hele011.ffn
-rw-r--r--. 1 wilkins login     3028 Nov 13 05:09 Ctena_mexicana_Hele011.fixedproducts
-rw-r--r--. 1 wilkins login  4348700 Nov 13 05:07 Ctena_mexicana_Hele011.fna
-rw-r--r--. 1 wilkins login  4352210 Nov 13 05:09 Ctena_mexicana_Hele011.fsa
-rw-r--r--. 1 wilkins login  9445929 Nov 13 05:09 Ctena_mexicana_Hele011.gbf
-rw-r--r--. 1 wilkins login  6110321 Nov 13 05:09 Ctena_mexicana_Hele011.gff
-rw-r--r--. 1 wilkins login    56487 Nov 13 05:09 Ctena_mexicana_Hele011.log
-rw-r--r--. 1 wilkins login  9448919 Nov 13 05:09 Ctena_mexicana_Hele011.ptg.gbk
-rw-r--r--. 1 wilkins login 16329641 Nov 13 05:09 Ctena_mexicana_Hele011.sqn
-rw-r--r--. 1 wilkins login  1160610 Nov 13 05:09 Ctena_mexicana_Hele011.tbl
-rw-r--r--. 1 wilkins login   402003 Nov 13 05:09 Ctena_mexicana_Hele011.tsv
-rw-r--r--. 1 wilkins login      108 Nov 13 05:09 Ctena_mexicana_Hele011.txt
-rw-r--r--. 1 wilkins login     3311 Nov 13 05:09 Ctena_mexicana_Hele011.val
-rw-r--r--. 1 wilkins login      132 Nov 13 05:09 errorsummary.val
[wilkins@gorilla Ctena_mexicana_StHelenaBay_011_SYM]$ cat errorsummary.val
     1 ERROR:   SEQ_FEAT.BadProteinName
     3 WARNING: SEQ_FEAT.BadEcNumberValue
     9 WARNING: SEQ_FEAT.ProteinNameEndsInBracket```
megaptera-helvetiae commented 4 years ago

So here is a little update.

I started from scratch. Created a new repository using the following commands:

git clone https://github.com/flass/pantagruel.git
Cloning into 'pantagruel'...
remote: Enumerating objects: 41, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 7243 (delta 21), reused 21 (delta 7), pack-reused 7202
Receiving objects: 100% (7243/7243), 6.98 MiB | 7.96 MiB/s, done.
Resolving deltas: 100% (5595/5595), done.
Checking out files: 100% (181/181), done.

[wilkins@gorilla scripts]$ git pull
Already up to date.

Then, I changed the path in my environment file that it would find the scripts at the new locations.

#!/bin/bash
## Pantagruel database 'test8'
## built with Pantagruel version '9531df2e57fd032f1ff8e11b79091953833f978e'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -d test8 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'                            
# derive paths to Pantagruel scripts and Python modules
export ptgscripts='/scratch/clamchatka/Panta/pantagruel/scripts'
export PYTHONPATH='/scratch/clamchatka/Panta/pantagruel/python_libs'
# database parameters (primary variables)
export ptgroot='/scratch/clamchatka/Panta'                            # root folder where to build the database
export ptgdbname='test8'                        # name of dataabse
export ptgversinit='9531df2e57fd032f1ff8e11b79091953833f978e'                    # current version of Pantagruel software
export myemail='undisclosed'                            # user identity (better use e-amil address)
export famprefix='PANTAG'                        # gene family prefix
export ncbitax='/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass=''                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/scratch/clamchatka/Panta/user_genomes'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}): 
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there 

# other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
# these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh

Particularly this: export ptgscripts='/scratch/clamchatka/Panta/pantagruel/scripts' export PYTHONPATH='/scratch/clamchatka/Panta/pantagruel/python_libs'

Then, I tried to run prokka again. This time it got the error much earlier.

pantagruel -i /scratch/clamchatka/Panta/test8/environ_pantagruel_test8.sh fetch -F
This is Pantagruel pipeline version 9531df2e57fd032f1ff8e11b79091953833f978e using source code from repository '/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
# will run tasks: 0
[2019-11-14 04:40:52] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
Task folder '/scratch/clamchatka/Panta/test8/00.input_data' already exists; FORCE mode is on: ERASE and recreate the folder to write new result in its place
[2019-11-14 04:40:52] extract assembly data from folder '/scratch/clamchatka/Panta/user_genomes'
found 14 contig files (raw genome assemblies) in /scratch/clamchatka/Panta/user_genomes/contigs/
[2019-11-14 04:40:53] Ctena_galapagana_StHelenaBay_001_SYM
will annotate contigs in '/scratch/clamchatka/Panta/user_genomes/contigs/Ctena_galapagana_StHelenaBay_001_SYM.fasta'
[2019-11-14 04:40:53]
### assembly: Ctena_galapagana_StHelenaBay_001_SYM; contig files from: /scratch/clamchatka/Panta/user_genomes/contigs/Ctena_galapagana_StHelenaBay_001_SYM.fasta
running Prokka...
done.
[2019-11-14 04:42:29]
fix annotation to integrate region information into GFF files
fix annotation to integrate taxid information into GBK files
ls: cannot access /scratch/clamchatka/Panta/test8/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/*.gbk: No such file or directory
Traceback (most recent call last):
  File "/scratch/clamchatka/Panta/pantagruel/scripts/add_taxid_feature2prokkaGBK.py", line 19, in <module>
    strain = lsp[strainfield]
IndexError: list index out of range
ERROR: something went wrong when modifying the GenBank flat file /scratch/clamchatka/Panta/test8/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/Ctena_mexicana_Hele001.gbf
ERROR: Pantagrel pipeline task 0: failed.

The error is still the same:

cat errorsummary.val 
     1 ERROR:   SEQ_FEAT.BadProteinName
     3 WARNING: SEQ_FEAT.BadEcNumberValue
     7 WARNING: SEQ_FEAT.ProteinNameEndsInBracket
flass commented 4 years ago

Hi Laetitia,

I am confused why you would get the error earlier, as it seems you are still running on the same version (9531df2)! I believe the pantagruel command links to the repository that was once installed by your admins. indeed the variable locating the active repository is defined as : ptgrepo='/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'

Instead, you should use the explicit path to the command:

/where/you/cloned/the/last/version/pantagruel/pantagruel

For instance, you can just refresh the config file using:

/where/you/cloned/the/last/version/pantagruel/pantagruel -i /scratch/clamchatka/Panta/test8/environ_pantagruel_test8.sh --refresh init

This should automatically set up the $ptgscripts and $PYTHONPATH correctly (but also importantly, the variable $ptgrepo)

But before doing that, I strongly suggest you run those additional commands to set up the submodule tree2 (cf. install_dependencies.sh script), otherwise it will soon cause problems:

cd /where/you/cloned/the/last/version/pantagruel/
git submodule init
git submodule update

then for every routine update, please make sure you use:

git pull
git submodule update

So once you've done that we should see a bit clearer, as for the moment I have the impression you run a mixture of old pipeline scripts with recent python modules (as your $PYTHONPATH point s to the the recent repository version).

flass commented 4 years ago

Now regarding the problem underlying your errors, my first guess would be that you may not have set (correctly) the /scratch/clamchatka/Panta/user_genomes/strain_info_test8.txt file. Can I have a glimpse of how it looks please?

flass commented 4 years ago

Also I suspect that there would be further information on the failure of prokka in the file Ctena_mexicana_Hele011.val, as it is not part of the regular bunch of output files

megaptera-helvetiae commented 4 years ago

To initialize I run:

pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init

I did setup the repository and update all the newest scripts and submodules as you suggested. This is now in /scratch/clamchatka/Panta/pantagruel.

Then I refresh the config_file:

/scratch/clamchatka/Panta/pantagruel/pantagruel -i /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh --refresh init

Then I go into the config_file and check but it did not update the paths to my updated scripts!

See yourself:

#!/bin/bash
## Pantagruel database 'test9'
## built with Pantagruel version '9531df2e57fd032f1ff8e11b79091953833f978e'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'                            
# derive paths to Pantagruel scripts and Python modules
export ptgscripts=${ptgrepo}/scripts
export PYTHONPATH=${ptgrepo}/python_libs
# database parameters (primary variables)
export ptgroot='/scratch/clamchatka/Panta'                            # root folder where to build the database
export ptgdbname='test9'                        # name of dataabse
export ptgversinit='9531df2e57fd032f1ff8e11b79091953833f978e'                    # current version of Pantagruel software
export myemail='undisclosed'                            # user identity (better use e-amil address)
export famprefix='PANTAG'                        # gene family prefix
export ncbitax='/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass=''                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/scratch/clamchatka/Panta/user_genomes'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}): 
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there 

# other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
# these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh

Should I now change the paths to ptgrepo, scripts and python libraries by hand???

I did this here to test:

#!/bin/bash
## Pantagruel database 'test9'
## built with Pantagruel version '9531df2e57fd032f1ff8e11b79091953833f978e'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/scratch/clamchatka/Panta/pantagruel'                            
# derive paths to Pantagruel scripts and Python modules
export ptgscripts='/scratch/clamchatka/Panta/pantagruel/scripts'
export PYTHONPATH='/scratch/clamchatka/Panta/pantagruel/python_libs'
# database parameters (primary variables)
export ptgroot='/scratch/clamchatka/Panta'                            # root folder where to build the database
export ptgdbname='test9'                        # name of dataabse
export ptgversinit='4867c048788ba7ec92dfd5ae9148d0349411151c'                    # current version of Pantagruel software
export myemail='undisclosed'                            # user identity (better use e-amil address)
export famprefix='PANTAG'                        # gene family prefix
export ncbitax='/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass=''                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/scratch/clamchatka/Panta/user_genomes'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}): 
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there 

# other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
# these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh

You see, I even changed the version manually!

Then I run Prokka:

pantagruel -i /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh fetch

And I get this error:

This is Pantagruel pipeline version 9531df2e57fd032f1ff8e11b79091953833f978e using source code from repository '/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
# will run tasks: 0
[2019-11-14 20:27:45] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
ERROR: the current version of pantagruel (commit 9531df2) is different from the one used to generate the config file '/scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh' (commit 4867c04).
Please regenerate the config file with `pantagruel init` to ensure compatibility; for the same parameters to be set, just run the same command with same options as previously.
ERROR: Pantagrel pipeline task 0: failed.

So I refresh my config file:

/scratch/clamchatka/Panta/pantagruel/pantagruel -i /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh --refresh init

Note, this put it back to the older version of pantagruel (export ptgrepo='/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
).

Here you see what happened to the config file:

#!/bin/bash
## Pantagruel database 'test9'
## built with Pantagruel version '9531df2e57fd032f1ff8e11b79091953833f978e'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'                            
# derive paths to Pantagruel scripts and Python modules
export ptgscripts=${ptgrepo}/scripts
export PYTHONPATH=${ptgrepo}/python_libs
# database parameters (primary variables)
export ptgroot='/scratch/clamchatka/Panta'                            # root folder where to build the database
export ptgdbname='test9'                        # name of dataabse
export ptgversinit='9531df2e57fd032f1ff8e11b79091953833f978e'                    # current version of Pantagruel software
export myemail='undisclosed'                            # user identity (better use e-amil address)
export famprefix='PANTAG'                        # gene family prefix
export ncbitax='/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass=''                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/scratch/clamchatka/Panta/user_genomes'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}): 
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there 

# other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
# these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh

My strain info file looks like this:

[wilkins@gorilla user_genomes]$ cat strain_infos_test9.txt
assembly_id genus   species strain  taxid   locus_tax_prefix
Ctena_galapagana_StHelenaBay_001_SYM    Ctena   mexicana    Hele001 1655433 CtegalHel001    
Ctena_imbricatula_STRI_051_SYM  Ctena   mexicana    STRI051 1655433 CteimSTRI051
Ctena_imbricatula_STRI_052_SYM  Ctena   mexicana    STRI052 1655433 CteimSTRI052
Ctena_imbricatula_STRI_065_SYM  Ctena   mexicana    STRI065 1655433 CteimSTRI065
Ctena_imbricatula_STRI_068_SYM  Ctena   mexicana    STRI068 1655433 CteimSTRI068
Ctena_imbricatula_STRI_070_SYM  Ctena   mexicana    STRI070 1655433 CteimSTRI070
Ctena_imbricatula_STRI_073_SYM  Ctena   mexicana    STRI073 1655433 CteimSTRI073
Ctena_imbricatula_STRI_074_SYM  Ctena   mexicana    STRI074 1655433 CteimSTRI074
Ctena_imbricatula_STRI_094_SYM  Ctena   mexicana    STRI094 1655433 CteimSTRI094
Ctena_mexicana_StHelenaBay_011_SYM  Ctena   mexicana    Hele011 1655433 CtegalHel011
Ctena_mexicana_StHelenaBay_012_SYM  Ctena   mexicana    Hele012 1655433 CtegalHel012
Ctena_mexicana_StHelenaBay_013_SYM  Ctena   mexicana    Hele013 1655433 CtegalHel013
Ctena_mexicana_StHelenaBay_014_SYM2 Ctena   mexicana    Hele014b    1655433 CtegalHel014b
Ctena_mexicana_StHelenaBay_014_SYM  Ctena   mexicana    Hele014 1655433 CtegalHel014

So I try to run prokka now:

pantagruel -i /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh fetch

We are still getting the same error:

running Prokka...
done.
[2019-11-14 20:46:46]
fix annotation to integrate region information into GFF files
fix annotation to integrate taxid information into GBK files
ls: cannot access /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_070_SYM/*.gbk: No such file or directory
done.

[2019-11-14 21:04:33]
fix annotation to integrate region information into GFF files
fix annotation to integrate taxid information into GBK files
ls: cannot access /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM/*.gbk: No such file or directory
done.
will create GenBank-like assembly folders for user-provided genomes
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_galapagana_StHelenaBay_001_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_051_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_052_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_065_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_068_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_070_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_073_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_074_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_imbricatula_STRI_094_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_011_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_012_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_013_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM/ is a directory -- ignored
gzip: /scratch/clamchatka/Panta/test9/00.input_data/annotation/Ctena_mexicana_StHelenaBay_014_SYM2/ is a directory -- ignored
Traceback (most recent call last):
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 366, in <module>
    main(nfldirassemb, dirassemblyinfo, output, defspename, nfdhandmetaraw, nfdhandmetacur, nfdhanddbxref)
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 72, in main
    lassemb = [parse_assembly_name(assembname, reass=reass) for assembname in lassembname]
  File "/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e/scripts/extract_metadata_from_gbff.py", line 60, in parse_assembly_name
    geass = seass.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
[2019-11-14 21:06:14]
Pantagrel pipeline task 0: complete.

Error file looks like this:

[wilkins@gorilla Ctena_mexicana_StHelenaBay_011_SYM]$ cat Ctena_mexicana_Hele011.val
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: Superoxide dismutase [Fe] [gnl|somewhere|CtegalHel011_00826:1-193] [gnl|somewhere|CtegalHel011_00826: raw, aa len= 193]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: Nicotinate-nucleotide pyrophosphorylase [carboxylating] [gnl|somewhere|CtegalHel011_00899:1-280] [gnl|somewhere|CtegalHel011_00899: raw, aa len= 280]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: GMP synthase [glutamine-hydrolyzing] [gnl|somewhere|CtegalHel011_01114:1-239] [gnl|somewhere|CtegalHel011_01114: raw, aa len= 239]
ERROR: valid [SEQ_FEAT.BadProteinName] Unknown or hypothetical protein should not have EC number FEATURE: Prot: hypothetical protein [gnl|somewhere|CtegalHel011_01239:1-321] [gnl|somewhere|CtegalHel011_01239: raw, aa len= 321]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: GMP synthase [glutamine-hydrolyzing] [gnl|somewhere|CtegalHel011_01596:1-194] [gnl|somewhere|CtegalHel011_01596: raw, aa len= 194]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: GMP synthase [glutamine-hydrolyzing] [gnl|somewhere|CtegalHel011_01905:1-239] [gnl|somewhere|CtegalHel011_01905: raw, aa len= 239]
WARNING: valid [SEQ_FEAT.BadEcNumberValue] 2.4.1.345 is not a legal value for qualifier EC_number FEATURE: Prot: Phosphatidyl-myo-inositol mannosyltransferase [gnl|somewhere|CtegalHel011_02328:1-404] [gnl|somewhere|CtegalHel011_02328: raw, aa len= 404]
WARNING: valid [SEQ_FEAT.BadEcNumberValue] 1.17.1.9 is not a legal value for qualifier EC_number FEATURE: Prot: Formate dehydrogenase H [gnl|somewhere|CtegalHel011_02586:1-915] [gnl|somewhere|CtegalHel011_02586: raw, aa len= 915]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: Phosphoenolpyruvate carboxykinase [GTP] [gnl|somewhere|CtegalHel011_03044:1-619] [gnl|somewhere|CtegalHel011_03044: raw, aa len= 619]
WARNING: valid [SEQ_FEAT.BadEcNumberValue] 1.17.1.9 is not a legal value for qualifier EC_number FEATURE: Prot: Formate dehydrogenase-O major subunit [gnl|somewhere|CtegalHel011_03055:1-964] [gnl|somewhere|CtegalHel011_03055: raw, aa len= 964]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: GMP synthase [glutamine-hydrolyzing] [gnl|somewhere|CtegalHel011_03205:1-70] [gnl|somewhere|CtegalHel011_03205: raw, aa len= 70]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: Glutamine--fructose-6-phosphate aminotransferase [isomerizing] [gnl|somewhere|CtegalHel011_03459:1-608] [gnl|somewhere|CtegalHel011_03459: raw, aa len= 608]
WARNING: valid [SEQ_FEAT.ProteinNameEndsInBracket] Protein name ends with bracket and may contain organism name FEATURE: Prot: GMP synthase [glutamine-hydrolyzing] [gnl|somewhere|CtegalHel011_03858:1-158] [gnl|somewhere|CtegalHel011_03858: raw, aa len= 158]

Can you help me update the repo and link it to the config file? I think this will save many issues I am having.

flass commented 4 years ago

it is annoying that using pantagruel --refresh init with your local install i.e.

/scratch/clamchatka/Panta/pantagruel/pantagruel -i /scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh --refresh init

you still end up with reference to the old version of the repo. This might have to do with you having a system-wide installation of pantagruel that somehow takes precedence on your local one in determining the version / path to repository folder.

Changing the config file by hand as you did above (and not running pantagruel --refresh init afterwards) should solve the issue of version check and et you get on using the new version.

about the error with Prokka: I am a bit confused about it not liking the names it gave itself to the proteins... this might be due to inconsistencies between the version of Prokka you run and the BioPerl libraries that are installed on your server.

can you give a look at what the command readlink -f $(which prokka) gives you please?

to avoid conflicts in perl libs, i would suggest installing prokka locally e.g. using homebrew (which should pick up the right BioPerl libs) and using that local install. if this is an issue to do that on your server, I would strongly suggest doing it on another machine like your laptop and then giving the proka annotation as input to pantagruel As explained [here}(https://github.com/flass/pantagruel#input-data), you just need to put each genome annotation in its respective folder, themselves in the folder user_genomes/annotation/.

megaptera-helvetiae commented 4 years ago

Hi Florent,

Here is the version of Prokka: /apps/prokka/1.14/bin/prokka

There is one BIG problem with your solution. If I update the config file manually, Pantagruel won't run anymore. See error message below.

[wilkins@gorilla Panta]$ pantagruel -i /scratch/clamchatka/Panta/test8/environ_pantagruel_test8.sh fetch
This is Pantagruel pipeline version 9531df2e57fd032f1ff8e11b79091953833f978e using source code from repository '/apps/pantagruel/9531df2e57fd032f1ff8e11b79091953833f978e'
# will run tasks: 0
[2019-11-15 19:33:26] Pantagrel pipeline task 0: fetch public genome data from NCBI sequence databases and annotate private genomes.
ERROR: the current version of pantagruel (commit 9531df2) is different from the one used to generate the config file '/scratch/clamchatka/Panta/test9/environ_pantagruel_test9.sh' (commit 4867c04).
Please regenerate the config file with `pantagruel init` to ensure compatibility; for the same parameters to be set, just run the same command with same options as previously.
ERROR: Pantagrel pipeline task 0: failed.

And here is the config file itself:

#!/bin/bash
## Pantagruel database 'test9'
## built with Pantagruel version '9531df2e57fd032f1ff8e11b79091953833f978e'; source code available at 'https://github.com/flass/pantagruel'

# the init command with which the config file was created
ptginitcmd='pantagruel -d test9 -r /scratch/clamchatka/Panta/ -a /scratch/clamchatka/Panta/user_genomes/ -T /scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12/ init'

# location (folder) of Pantagruel software that was used
export ptgrepo='/scratch/clamchatka/Panta/pantagruel'                            
# derive paths to Pantagruel scripts and Python modules
export ptgscripts='/scratch/clamchatka/Panta/pantagruel/scripts'
export PYTHONPATH='/scratch/clamchatka/Panta/pantagruel/python_libs'
# database parameters (primary variables)
export ptgroot='/scratch/clamchatka/Panta'                            # root folder where to build the database
export ptgdbname='test9'                        # name of dataabse
export ptgversinit='4867c048788ba7ec92dfd5ae9148d0349411151c'                    # current version of Pantagruel software
export myemail='undisclosed'                            # user identity (better use e-amil address)
export famprefix='PANTAG'                        # gene family prefix
export ncbitax='/scratch/clamchatka/Panta/NCBI/Taxonomy_2019-11-12'                            # folder of up-to-date NCBI Taxonomy database
export ncbiass=''                            # folder of RefSeq genomes to include in the study
export listncbiass=''                    # list of accessions of RefSeq genomes to include in the study
export customassemb='/scratch/clamchatka/Panta/user_genomes'                  # folder of custom genome assemblies to include in the study
export refass=''                              # folder of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export listrefass=''                      # list of accessions of reference (RefSeq) genomes only to use as reference for the annotation of custom genome assemblies
export coreseqtype='cds'                    # either 'cds' or 'protein'
export pseudocoremingenomes=    # the minimum number of genomes in which a gene family should be present to be included in the pseudo-core genome gene set
export userreftree=''                    # possible user-provided reference tree
export poplgthresh='default'                    # parameter to define populations of genomes in the reference tree (stem branch length threshold, default value depends on coreseqtype)
export poplgleafmul='1.5'                  # parameter to define populations of genomes in the reference tree (multiplier to the former in case it is a leaf, default 1.5)
export popbsthresh='80'                    # parameter to define populations of genomes in the reference tree (stem branch support threshold, default 80)
export rootingmethod='treebalance'
export chaintype='fullgenetree'                        # whether gene trees will be collapsed ('collapsed', if -c option enabled) or not ('fullgenetree', default)
export genefamlist=''                    # list of gene families for which computation of gene trees and all subsequent analyses will be restricted (default: no restriction)
# non-default parameters for gene trees collapsing derived from -C option value (passed to init script via ${collapseCladeParams}): 
export cladesupp=70                          # - clade criterion trheshold (int)
export subcladesupp=35                    # - wihtin-clade criterion trheshold (int)
export criterion='bs'                        # - criterion (branch support: 'bs', branch length 'lg')
export withinfun='median'                        # - aggregate function for testing within the clade ('min', 'max', 'mean', 'median')
export hpcremoteptgroot='none'          # if not empty nor 'none', will use this server address to send data and scripts to run heavy computions there 

# other parameters have default values defined in the generic source file environ_pantagruel_defaults.sh
source ${ptgscripts}/pipeline/environ_pantagruel_defaults.sh
# these defalts can be overriden by uncommenting the relevant line below and editing the variable's value
# default values are:
# Prokka annotation parameters (only relevant if custom genome assemblies are provided):
#~ export assembler="somesoftware"
#~ export seqcentre="somewhere"
#~ export refgenus="Reference"
# species tree inference parameters
#~ export ncorebootstrap=200
# gene tree inference parameters
#~ export mainresulttag='rootedTree'
# gene trees collapsing DEFAULT values (used when -C option is NOT present in init call)
#~ export cladesuppdef=70
#~ export subcladesuppdef=35
#~ export criteriondef='bs'
#~ export withinfundef='median'
# gene tree/species tree reconciliation inference parameters
#~ export ALEalgo='ALEml'
#~ export recsamplesize=1000
# gene tree/species tree reconciliation parsing parameters for co-evolution analysis
#~ export evtypeparse='ST'
#~ export minevfreqparse=0.1
#~ export minevfreqmatch=0.5
#~ export minjoinevfreqmatch=1.0
#~ export maxreftreeheight=0.25

# secondary vars are defined based on the above
source ${ptgscripts}/pipeline/environ_pantagruel_secondaryvars.sh
# load shared functions
source ${ptgscripts}/pipeline/pantagruel_pipeline_functions.sh
flass commented 4 years ago

Hi Laetitia,

sorry for this long ping-pong of error tracking. I think this error of 'wrong version' is due to the fact you used the system-wide installed version by calling just pantagruel instead of /scratch/clamchatka/Panta/pantagruel/pantagruel (if I am correct that is where you cloned the last versions of the repository). Please try again with the full path to the executable.

megaptera-helvetiae commented 4 years ago

It is fixed and running.

Thank you so much for your patience, Florent. I hope others can benefit from our 20 ball long ping pong rally. https://www.youtube.com/watch?v=jlBOsGASWPI