Task 5 failed; Tasks 0,1,2,3 were completed but their output files do not look complete.

Hi Florent,

I am trying to run your pipeline with 9 genomes. I didn't externally prepared annotation files, to just rely on Pantagruel's built-in annotation. Here is how my input directory/files looked like.

$ ls /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/ contigs strain_infos_Buniformis_pantagruel.txt $ ls /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/contigs/ Bu_1.fasta Bu_2.fasta Bu_3.fasta Bu_4.fasta Bu_5.fasta Bu_6.fasta Bu_7.fasta Bu_8.fasta Bu_9.fasta

$ head /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/strain_infos_Buniformis_pantagruel.txt assembly_id genus species strain taxid locus_tag_prefix Bu_1 Bacteroides uniformis Bu_1 820 BUBU1 Bu_2 Bacteroides uniformis Bu_2 820 BUBU2 Bu_3 Bacteroides uniformis Bu_3 820 BUBU3 Bu_4 Bacteroides uniformis Bu_4 820 BUBU4 Bu_5 Bacteroides uniformis Bu_5 820 BUBU5 Bu_6 Bacteroides uniformis Bu_6 820 BUBU6 Bu_7 Bacteroides uniformis Bu_7 820 BUBU7 Bu_8 Bacteroides uniformis Bu_8 820 BUBU8 Bu_9 Bacteroides uniformis Bu_9 820 BUBU9

With the above directory as input -a, I ran the "init" task.

$ pantagruel -d Buniformis_pantagruel -r /mnt/disks/permanentDisk/genomics/pantagruel_testruns -a /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes -I kihyunee@gmail.com init This is Pantagruel pipeline version 9a5ebb0a88d882def46be2509964974d9a6ece66 using source code from repository '/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel' set custom (raw) genome assembly source folder to '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes' set identity to 'kihyunee@gmail.com' will run tasks: init ... created init file at '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh' Custom strain info file detected; format validated Pantagrel pipeline task init: complete.

init was completed and new files appeared in the user genome input directory.

$ du -sh /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/* 592M /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/annotation 42M /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/contigs 64M /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/genbank-format_assemblies 4.0K /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/strain_infos_Buniformis_pantagruel.txt

Task 00 (fetch) initially returned error message:

$ pantagruel -i Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh fetch ERROR: the current version of pantagruel (commit 9a5ebb0) is different from the one used to generate the config file '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh' (commit ). Please regenerate the config file with pantagruel init to ensure compatibility; for the same parameters to be set, just run the same command with same options as previously. ERROR: Pantagrel pipeline task 0: failed.

That error was avoided by manually putting the version value in the ptgversinit field in the env-.sh file.

$ grep "version" environ_pantagruel_Buniformis_pantagruel.sh built with Pantagruel version '9a5ebb0a88d882def46be2509964974d9a6ece66'; source code available at 'https://github.com/flass/pantagruel' export ptgversinit='' # current version of Pantagruel software

So task 00 (fetch) was run again,

$ pantagruel -i Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh fetch This is Pantagruel pipeline version 9a5ebb0a88d882def46be2509964974d9a6ece66 using source code from repository '/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel' will run tasks: 0 ... Create new task folder '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/00.input_data' did not find the relevant taxonomy flat files in '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/NCBI/Taxonomy_2019-07-25/'; download the from NCBI Taxonomy FTP cd ok, cwd=/pub/taxonomy
57388654 bytes transferred in 2 seconds (34.31 MiB/s)
Total 6 files transferred taxcat.tar.gz: OK taxdump.tar.gz: OK /mnt/disks/permanentDisk/genomics/pantagruel_testruns extract assembly data from folder '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes' ls: cannot access '//_genomic.gbff.gz': No such file or directory parallel: Error: Cannot open input file `/assemblies_genomic_gbffgz_list': No such file or directory. removed '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/tmp/Reference.faa' removed '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/tmp/Reference_representative.faa.clstr'

Building a new DB, current time: 07/25/2019 07:43:08 New DB name: /home/linuxbrew/.linuxbrew/Cellar/prokka/1.13/db/genus/Reference New DB title: Reference Sequence type: Protein Keep MBits: T Maximum file size: 1000000000B BLAST options error: File Reference is empty /mnt/disks/permanentDisk/genomics/pantagruel_testruns will annotate contigs in '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/contigs/Bu_1.fasta' [2019-07-25 07:43:08]

assembly: Bu_1; contig files from: /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_genomes/contigs/Bu_1.fasta running Prokka... done. [2019-07-25 07:47:01] fix annotation to integrate region information into GFF files /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/pipeline/pantagruel_pipeline_00_fetch_data.sh: line 154: [: too many arguments gnl|somewhere|BUBU1_1 somesoftware region 1 513255 . + . ID=id0;Dbxref=taxon:820;Is_circular=false;gbkey=Src;genome=contig;mol_type=genomic DNA;strain=Bu_1 ... LINES PASSED FOR OTHER CONTIGS ... gnl|somewhere|BUBU1_33 somesoftware region 1 2374 . + . ID=id32;Dbxref=taxon:820;Is_circular=false;gbkey=Src;genome=contig;mol_type=genomic DNA;strain=Bu_1 fix annotation to integrate taxid information into GBK files done. ... MANY LINES PASSED FOR OTHER SAMPLES ...

will create GenBank-like assembly folders for user-provided genomes grep: /_assembly_stats.txt: No such file or directory grep: /_assembly_stats.txt: No such file or directory grep: /_assembly_stats.txt: No such file or directory grep: /_assembly_stats.txt: No such file or directory grep: /_assembly_stats.txt: No such file or directory grep: /_assembly_stats.txt: No such file or directory parsing genome annotation from genBank flat files... Bu_1.1 Bu_2.1 Bu_3.1 Bu_4.1 Bu_5.1 Bu_6.1 Bu_7.1 Bu_8.1 Bu_9.1 ...done Bu_1.1 Bu_1.1; Bacteroides uniformis; "Bu_1"; ; ; ... Bu_9.1 Bu_9.1; Bacteroides uniformis; "Bu_9"; ; ; Pantagrel pipeline task 0: complete.

Task 00 was completed, though the output files seem to be too small ?

$ du -sh Buniformis_pantagruel/00.input_data/* 40K Buniformis_pantagruel/00.input_data/assemblies 0 Buniformis_pantagruel/00.input_data/assemblies_genomic_gbffgz_list 4.0K Buniformis_pantagruel/00.input_data/assembly_stats 4.0K Buniformis_pantagruel/00.input_data/extracted_cds_from_genomic_fasta 3.7M Buniformis_pantagruel/00.input_data/genome_infos

Task 01 (homologous) was then run,

$ pantagruel -i Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh homologous MMseqs Version: 9-d36de ... Total time: 0h 0m 4s 407ms Size of the sequence database: 35701 Size of the alignment database: 35701 Number of clusters: 17006 Writing results 0h 0m 0s 6ms Time for merging files: 0h 0m 0s 3ms Time for processing: 0h 0m 4s 432ms ... nfin = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' ; famprefix = 'NRPROT' ; dirout = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/all_proteomes.clusthashdb_minseqid100_families' ; padlen = 6 ; writeseq = False ; discardsingle = False Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/split_mmseqs_clustdb_fasta.py", line 58, in with open(nfin, 'r') as fin: IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' listed 0 redundant sequences in dataset generated hash index parsing redundant sequence fasta filtered 35701 non-redundant sequences parse NCBI Taxonomy merged taxon ids from '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/NCBI/Taxonomy_2019-07-25/merged.dmp' parse NCBI Taxonomy taxon names from '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/NCBI/Taxonomy_2019-07-25/names.dmp' parse redundant protein names from '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/all_proteomes.identicals.list' parse assembly '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1' ... parse assembly '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9'

createseqfiledb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/all_proteomes.nr.mmseqsdb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters

MMseqs Version: 9-d36de ... Time for processing: 0h 0m 0s 43ms nfin = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters' ; famprefix = 'PANTAGP' ; dirout = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta' ; padlen = 6 ; writeseq = True ; discardsingle = False Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/split_mmseqs_clustdb_fasta.py", line 58, in with open(nfin, 'r') as fin: IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters' [2019-07-25 09:08:52] -- 0 non-redundant proteins [2019-07-25 09:08:52] -- classified into 1 clusters -- including artificial cluster PANTAGP000000 gathering 0 ORFan nr proteins -- (NB: some are not true ORFans as can be be present as identical sequences in several genomes) Pantagrel pipeline task 1: complete.

Task 01 was completed but produced some error messages. I really suspect that this step didn't went correctly because only one cluster was generated.

$ ls Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta/ PANTAGP000000.fasta

$ du Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta/* 0 Buniformis_pantagruel/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta/PANTAGP000000.fasta

Subsequently, task 02 (align) and 03 (sqldb) reported error messages like these:

$ pantagruel -i Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh align Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/extract_full_prot_and_cds_family_alignments.py", line 469, in main(dirnrprotaln, nfsingletonfasta, nfprotinfotab, nfreplinfotab, dirassemb, dirout, fam_prefix, dirlogs, nfidentseq, nbcores, verbose) File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/extract_full_prot_and_cds_family_alignments.py", line 217, in main lastfam = int(allprotfams[-1].split(prefixprotfam)[-1]) IndexError: list index out of range

cat: /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/02.gene_alignments/full_families_genome_counts-noORFans.mat: No such file or directory tail: cannot open '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/02.gene_alignments/PANTAGC000000_genome_counts-ORFans.mat' for reading: No such file or directory

Pantagrel pipeline task 2: complete.

$ pantagruel -i Buniformis_pantagruel/environ_pantagruel_Buniformis_pantagruel.sh sqldb Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/pantagruel_sqlitedb_genome_populate.py", line 329, in main(dbname, protorfanclust, cdsorfanclust, nfspeclist, nfusergenomeinfo, usergenomefinalassdir) File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/pantagruel_sqlitedb_genome_populate.py", line 119, in main assert nprotrecords>0 AssertionError /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/03.database Error: no such column: code Error: no such column: code Error: no such column: cds_code Pantagrel pipeline task 3: complete.

Subsequently, task 05 (core) failed with the following messages.

Loading matrix of gene families counts in genomes... Error in file(file, "rt") : cannot open the connection Calls: data.matrix -> is.data.frame -> read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/02.gene_alignments/full_families_genome_counts-noORFans.mat': No such file or directory Execution halted cat: /mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/05.core_genome/pseudo-coregenome_sets/strict-core-unicopy_families.tab: No such file or directory Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/concat.py", line 93, in main() File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/concat.py", line 56, in main files=utilitaires.fileToLines(aln_files) File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/utilitaires.py", line 24, in fileToLines fich=open(filename,'r')

IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/05.core_genome/pseudo-coregenome_sets/strict-core-unicopy_cds_aln_list' ERROR: failed to produce concatenated (pseudo)core-genome alignment ERROR: Pantagrel pipeline task 5: failed.

I can't figure out which step went wrong and how to avoid this error. Though I suspect that either 00 or 01 already was not successful. Would you mind look at those error messages?

Thanks, Kihyun

Hi Kihyun,

thank you very much for the thorough bug report. As you suggested, it is likely that the error lies in the first steps, which will necessarily impact the results of the downstream tasks. The protein clustering (task 01|homologous) clearly seems to fail, but I am not sure the problem arises within this task or during the previous.

Scanning the first reports of error, I could diagnostic and fix some problems:

0) (on your side!) it seems that your custom strin information file strain_infos_Buniformis_pantagruel.txt is space delimited. it has to be delimited with TABS! try and repeat creating your database with that changed, as it could have far-reaching consequences (a lot of scripts read this file, and they expect tab delimiters)

1) the error related to the package version (involving ptgversinit) was a small thing to fix (commit 7f7485a), but anyway you worked around it so it is not the real problem.

2) the second errors seem to stem from the original inability to locate the genome assembly file in the (normally faculatative) process to build the Prokka reference database:

ls: cannot access '//_genomic.gbff.gz': No such file or directory
parallel: Error: Cannot open input file `/assemblies_genomic_gbffgz_list': No such file or directory.

this looks like this fails because the environment variable ${refass} was not defined (through facultative option --refseq_ass4annot) - but as you did not specify these options it should not have showed up. I think I fixed the reference database issue in the the commits 2ebb161 and ac74e9f. This should not be a cause of subsequent errors anyway and indeed Prokka has run.

3) the error grep: /_assembly_stats.txt: No such file or directory is linked to the absence of the assembly statistics files, but is not supposed to have consequences. I fixed this unsightly report of absence in commit 6ab91a3.

At this point everything should be fine, but I would need to know whether the task has actually completed the way it should have. for that I would need you to run the command :

ls -lA Buniformis_pantagruel/00.input_data/assemblies/*

and please report the output. Also can you please attach your environment file environ_pantagruel_Buniformis_pantagruel.sh.

If the genome assembly files are not correct, that would explain the subsequent errors in the protein clustering. If that is the case you can have a try at repeating with the last version (which includes the fixes in the 00 task), see if it fixes the problem downstream.

best wishes, Florent

Hi Florent, First of all, thank you so much!

'strain_infos_Buniformis_pantagruel.txt' looked like space-delimited on that post that I wrote before, but it is actually tab-delimited. Tabs turned into spaced when I dragged-copied that lines from shell screen and pasted them in the text editor. Just in case, I attached the file here. strain_infos_Buniformis_pantagruel.txt

1-3. Sounds like you have made some changes to the code, through these "commits".
And now I'd want to re-install the most recent Pantagruel (because its code has been updated by you) and see if it works differently -- Is this correct? That last one maybe a strange question but I am not familiar with how "commits" work. Also I want to apologize because what I am going to write below might be a miscommunication.

You suggested me to shared a few things ('ls -lA ', 'env....sh') and I am not sure whether you meant to see 'ls -lA ...' and 'env...sh' at the current (previous failed run's) state, or what will be generated by the newly updated version. For now I just attach them as they are in the current state (from the previous failed run).

(A) ls -lA Buniformis_pantagruel/00.input_data/assemblies/*

reply_1.ls_lA.result.txt

(B) environ_pantagruel_Buniformis_pantagruel.sh (just added .txt to enable attachment) environ_pantagruel_Buniformis_pantagruel.sh.txt

(C) Just in case, as you mentioned a possibility that genome assembly files could be wrong, I also attach here a link to one of the nine genome contig files that I used as input. All other contig fasta files have the same structure, except for the numbering part in Bu_1 Bu_2 ... https://www.dropbox.com/s/al3r9czyw28fppo/Bu_1.fasta?dl=0

Thank you again, taking care of this lengthy questions. If any other file or information is worth your look please tell me. I'll just try to repeat the process after installing the newest version and see what happens.

Best, Kihyun

Hi Kihyun,

no problem at all for answering these questions, that help me maintain my program! also thank you for the files and clarifications, that will help seeing where is (are) the issue(s).

yes I updated the code, so I suggest you run the following commands:

# go into the code repository 
cd pantagruel_pipeline/pantagruel
# then update the code
git pull
# then refresh the pantagruel database configuration file
pantagruel -i pantagruel_testruns/environ_pantagruel_Buniformis_pantagruel.sh --refresh init

cf. the guidance described here: https://github.com/flass/pantagruel#usage-example

about the output of the ls -lA command you attached in (A), I am a bit surprised that it does not list the content of these folders. Is there anything in them (in the actual folder beyond the symbolic link)? please send me a list of the files you have (at least in one of the assembly folders) under 00.input_data/assemblies/*/. if possible I'd like to see it as it was in the state of your first bug report; then you could re-run the pipeline tasks init, 00 and 01 (either erase the whole database folder, or name it another name to start from fresh).

I'll run tests on my side (i was actually doing it with the test dataset provided and some [different error showed up as well...)

Best wishes Florent

Hi Florent, Thank you for code update instruction! I'll try to run init, 00, 01 tasks with the updated code and that's to be reported later, but regarding your last questions I can immediately tell you what files there are, unchanged since my bug report.

What are in the actual folder beyond the symbolic links:

$ ls Buniformis_genomes/genbank-format_assemblies/Bu_*/

Buniformis_genomes/genbank-format_assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/: Bu_1.1_Bacteroides_uniformis_Bu_1_cds_from_genomic.fna.gz Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gff.gz Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gbff.gz Bu_1.1_Bacteroides_uniformis_Bu_1_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/: Bu_2.1_Bacteroides_uniformis_Bu_2_cds_from_genomic.fna.gz Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gff.gz Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gbff.gz Bu_2.1_Bacteroides_uniformis_Bu_2_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/: Bu_3.1_Bacteroides_uniformis_Bu_3_cds_from_genomic.fna.gz Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gff.gz Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gbff.gz Bu_3.1_Bacteroides_uniformis_Bu_3_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/: Bu_4.1_Bacteroides_uniformis_Bu_4_cds_from_genomic.fna.gz Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gff.gz Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gbff.gz Bu_4.1_Bacteroides_uniformis_Bu_4_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/: Bu_5.1_Bacteroides_uniformis_Bu_5_cds_from_genomic.fna.gz Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gff.gz Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gbff.gz Bu_5.1_Bacteroides_uniformis_Bu_5_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/: Bu_6.1_Bacteroides_uniformis_Bu_6_cds_from_genomic.fna.gz Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gff.gz Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gbff.gz Bu_6.1_Bacteroides_uniformis_Bu_6_protein.faa.gz

Buniformis_genomes/genbank-format_assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/: Bu_7.1_Bacteroides_uniformis_Bu_7_cds_from_genomic.fna.gz Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gff.gz Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gbff.gz Bu_7.1_Bacteroides_uniformis_Bu_7_protein.faa.gz

*What are under 00.input_data/assemblies// ? --> Essentially the same things are here too, same as the above (the actual folder beyond link). Like:**

$ ls Buniformis_pantagruel/00.input_data/assemblies/*/ Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/: Bu_1.1_Bacteroides_uniformis_Bu_1_cds_from_genomic.fna.gz Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gff.gz Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gbff.gz Bu_1.1_Bacteroides_uniformis_Bu_1_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/: Bu_2.1_Bacteroides_uniformis_Bu_2_cds_from_genomic.fna.gz Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gff.gz Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gbff.gz Bu_2.1_Bacteroides_uniformis_Bu_2_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/: Bu_3.1_Bacteroides_uniformis_Bu_3_cds_from_genomic.fna.gz Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gff.gz Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gbff.gz Bu_3.1_Bacteroides_uniformis_Bu_3_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/: Bu_4.1_Bacteroides_uniformis_Bu_4_cds_from_genomic.fna.gz Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gff.gz Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gbff.gz Bu_4.1_Bacteroides_uniformis_Bu_4_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/: Bu_5.1_Bacteroides_uniformis_Bu_5_cds_from_genomic.fna.gz Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gff.gz Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gbff.gz Bu_5.1_Bacteroides_uniformis_Bu_5_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/: Bu_6.1_Bacteroides_uniformis_Bu_6_cds_from_genomic.fna.gz Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gff.gz Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gbff.gz Bu_6.1_Bacteroides_uniformis_Bu_6_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/: Bu_7.1_Bacteroides_uniformis_Bu_7_cds_from_genomic.fna.gz Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gff.gz Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gbff.gz Bu_7.1_Bacteroides_uniformis_Bu_7_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_8.1_Bacteroides_uniformis_Bu_8/: Bu_8.1_Bacteroides_uniformis_Bu_8_cds_from_genomic.fna.gz Bu_8.1_Bacteroides_uniformis_Bu_8_genomic.gff.gz Bu_8.1_Bacteroides_uniformis_Bu_8_genomic.gbff.gz Bu_8.1_Bacteroides_uniformis_Bu_8_protein.faa.gz

Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9/: Bu_9.1_Bacteroides_uniformis_Bu_9_cds_from_genomic.fna.gz Bu_9.1_Bacteroides_uniformis_Bu_9_genomic.gff.gz Bu_9.1_Bacteroides_uniformis_Bu_9_genomic.gbff.gz Bu_9.1_Bacteroides_uniformis_Bu_9_protein.faa.gz

Are these files (faa, fna, gff, gbff, ...) insanely small? --> No their file size seem to be fine. Not empty...

$ du -sh Buniformis_pantagruel/00.input_data/assemblies/*/* 1.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/Bu_1.1_Bacteroides_uniformis_Bu_1_cds_from_genomic.fna.gz 3.1M Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gbff.gz 1.7M Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/Bu_1.1_Bacteroides_uniformis_Bu_1_genomic.gff.gz 900K Buniformis_pantagruel/00.input_data/assemblies/Bu_1.1_Bacteroides_uniformis_Bu_1/Bu_1.1_Bacteroides_uniformis_Bu_1_protein.faa.gz 1.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/Bu_2.1_Bacteroides_uniformis_Bu_2_cds_from_genomic.fna.gz 3.2M Buniformis_pantagruel/00.input_data/assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gbff.gz 1.7M Buniformis_pantagruel/00.input_data/assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/Bu_2.1_Bacteroides_uniformis_Bu_2_genomic.gff.gz 932K Buniformis_pantagruel/00.input_data/assemblies/Bu_2.1_Bacteroides_uniformis_Bu_2/Bu_2.1_Bacteroides_uniformis_Bu_2_protein.faa.gz 1.5M Buniformis_pantagruel/00.input_data/assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/Bu_3.1_Bacteroides_uniformis_Bu_3_cds_from_genomic.fna.gz 3.2M Buniformis_pantagruel/00.input_data/assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gbff.gz 1.7M Buniformis_pantagruel/00.input_data/assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/Bu_3.1_Bacteroides_uniformis_Bu_3_genomic.gff.gz 940K Buniformis_pantagruel/00.input_data/assemblies/Bu_3.1_Bacteroides_uniformis_Bu_3/Bu_3.1_Bacteroides_uniformis_Bu_3_protein.faa.gz 1.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/Bu_4.1_Bacteroides_uniformis_Bu_4_cds_from_genomic.fna.gz 3.1M Buniformis_pantagruel/00.input_data/assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gbff.gz 1.7M Buniformis_pantagruel/00.input_data/assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/Bu_4.1_Bacteroides_uniformis_Bu_4_genomic.gff.gz 924K Buniformis_pantagruel/00.input_data/assemblies/Bu_4.1_Bacteroides_uniformis_Bu_4/Bu_4.1_Bacteroides_uniformis_Bu_4_protein.faa.gz 1.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/Bu_5.1_Bacteroides_uniformis_Bu_5_cds_from_genomic.fna.gz 3.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gbff.gz 1.8M Buniformis_pantagruel/00.input_data/assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/Bu_5.1_Bacteroides_uniformis_Bu_5_genomic.gff.gz 932K Buniformis_pantagruel/00.input_data/assemblies/Bu_5.1_Bacteroides_uniformis_Bu_5/Bu_5.1_Bacteroides_uniformis_Bu_5_protein.faa.gz 1.5M Buniformis_pantagruel/00.input_data/assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/Bu_6.1_Bacteroides_uniformis_Bu_6_cds_from_genomic.fna.gz 3.3M Buniformis_pantagruel/00.input_data/assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gbff.gz 1.8M Buniformis_pantagruel/00.input_data/assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/Bu_6.1_Bacteroides_uniformis_Bu_6_genomic.gff.gz 960K Buniformis_pantagruel/00.input_data/assemblies/Bu_6.1_Bacteroides_uniformis_Bu_6/Bu_6.1_Bacteroides_uniformis_Bu_6_protein.faa.gz 1.2M Buniformis_pantagruel/00.input_data/assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/Bu_7.1_Bacteroides_uniformis_Bu_7_cds_from_genomic.fna.gz 3.1M Buniformis_pantagruel/00.input_data/assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gbff.gz 1.6M Buniformis_pantagruel/00.input_data/assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/Bu_7.1_Bacteroides_uniformis_Bu_7_genomic.gff.gz 800K Buniformis_pantagruel/00.input_data/assemblies/Bu_7.1_Bacteroides_uniformis_Bu_7/Bu_7.1_Bacteroides_uniformis_Bu_7_protein.faa.gz 1.2M Buniformis_pantagruel/00.input_data/assemblies/Bu_8.1_Bacteroides_uniformis_Bu_8/Bu_8.1_Bacteroides_uniformis_Bu_8_cds_from_genomic.fna.gz 3.1M Buniformis_pantagruel/00.input_data/assemblies/Bu_8.1_Bacteroides_uniformis_Bu_8/Bu_8.1_Bacteroides_uniformis_Bu_8_genomic.gbff.gz 1.6M Buniformis_pantagruel/00.input_data/assemblies/Bu_8.1_Bacteroides_uniformis_Bu_8/Bu_8.1_Bacteroides_uniformis_Bu_8_genomic.gff.gz 800K Buniformis_pantagruel/00.input_data/assemblies/Bu_8.1_Bacteroides_uniformis_Bu_8/Bu_8.1_Bacteroides_uniformis_Bu_8_protein.faa.gz 1.4M Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9/Bu_9.1_Bacteroides_uniformis_Bu_9_cds_from_genomic.fna.gz 3.2M Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9/Bu_9.1_Bacteroides_uniformis_Bu_9_genomic.gbff.gz 1.7M Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9/Bu_9.1_Bacteroides_uniformis_Bu_9_genomic.gff.gz 936K Buniformis_pantagruel/00.input_data/assemblies/Bu_9.1_Bacteroides_uniformis_Bu_9/Bu_9.1_Bacteroides_uniformis_Bu_9_protein.faa.gz

Oh it is nice to know that you are also doing test runs. I wish that things will eventually get resolved!

Thanks, Kihyun

all right so there definitely does not seem to be a (major) problem in task 00; it must be in the task 01. I'll investigate and keep you updated. Cheers Florent

Hi Kihyun, I'm glad to say that the program runs smoothly up to task 03 (included) on the test dataset. I cannot replicate with yours not having the full dataset (please don't post the whole thing here, i understand it can be sensitive data!) but I encourage you to try again with the latest version of the software. Best wishes, Florent

UPDATE: on my side and based on the test dataset, the pipeline runs smoothly tasks init and 00 through 05. Looking forward to have reports of your runs.

Hi Florent, I am glad to know that you got rid of the problems and the pipeline runs smoothly. That motivated me to run Pantagruel on the test dataset provided by you, instead of my own one. Just in case that my data might have unexpected faults. In this way, I thought that I maybe able to see whether or not my installation/environment is intact.

So I updated the package to the very recent version (7857f9d52dc4ab41603d53a8bb463b030a4dcf20) and copied the test dataset from the installation path's pantagruel/data/ to my location, under testdata/ ls testdata/

NCBI_Assembly_accession_ids_test_10Brady custom_genomes

I used test run commands following test run except that I ran init, 00, 01, 02, ... sequentially instead of all, just to see easily the messeages from each task.

init pantagruel -d testPTGdatabase -r ./ -f PANTAGFAM -I kihyunee@gmail.com -L testdata/NCBI_Assembly_accession_ids_test_10Brady -a testdata/custom_genomes init Was successful; Didn't get any error-like message

00 pantagruel -i testPTGdatabase/environ_pantagruel_testPTGdatabase.sh 00 No error messages. What files created? ls testPTGdatabase/00.input_data/assemblies/*/ There were eleven directories under 00.input_data/assemblies/: ten directories, each per input refseq genome accession (-L input) and one directory for manual input genome (-a input). Like:

testPTGdatabase/00.input_data/assemblies/FOQJ01.1_Bradyrhizobium_sp_cf659/: FOQJ01.1_Bradyrhizobium_sp_cf659_cds_from_genomic.fna.gz FOQJ01.1_Bradyrhizobium_sp_cf659_genomic.gff.gz FOQJ01.1_Bradyrhizobium_sp_cf659_genomic.gbff.gz FOQJ01.1_Bradyrhizobium_sp_cf659_protein.faa.gz ... testPTGdatabase/00.input_data/assemblies/GCF_000472425.1_ASM47242v1/: GCF_000472425.1_ASM47242v1_assembly_report.txt GCF_000472425.1_ASM47242v1_genomic.fna.gz GCF_000472425.1_ASM47242v1_protein.gpff.gz annotation_hashes.txt GCF_000472425.1_ASM47242v1_assembly_stats.txt GCF_000472425.1_ASM47242v1_genomic.gbff GCF_000472425.1_ASM47242v1_rna_from_genomic.fna.gz assembly_status.txt GCF_000472425.1_ASM47242v1_cds_from_genomic.fna.gz GCF_000472425.1_ASM47242v1_genomic.gbff.gz GCF_000472425.1_ASM47242v1_translated_cds.faa.gz md5checksums.txt GCF_000472425.1_ASM47242v1_feature_count.txt.gz GCF_000472425.1_ASM47242v1_genomic.gff.gz GCF_000472425.1_ASM47242v1_wgsmaster.gbff.gz GCF_000472425.1_ASM47242v1_feature_table.txt.gz GCF_000472425.1_ASM47242v1_protein.faa.gz README.txt

ls testPTGdatabase/00.input_data/assembly_stats/

assembly_level.tab contig-N50 contig-count sequencing_technology.tab

ls testPTGdatabase/00.input_data/extracted_cds_from_genomic_fasta/

GCF_000465325.1_Brad_ente_CordColits_V1

The above part caught my eyes, because there was EXTRACTED CDS for ONLY ONE INPUT genome

ls testPTGdatabase/00.input_data/genome_infos/

assemblies_list assembly_metadata manual_input_metadata

ls testPTGdatabase/00.input_data/genome_infos/assembly_metadata/

dbxrefs.tab metadata.tab metadata_curated.tab

ls testPTGdatabase/00.input_data/genome_infos/manual_input_metadata/

manual_curated_metadata_dictionary.tab manual_dbxrefs.tab manual_metadata_dictionary.tab

Was the task 00 finished successfully? I am not sure about that because, 00.input_data/extracted_cds_from_genomic_fasta/ had a file for only one genome.

Task 01 pantagruel -i testPTGdatabase/environ_pantagruel_testPTGdatabase.sh 01 From the standard output, I got some lines that sound problematic and here they are:

... [2019-07-29 05:30:49] -- 11 proteomes in dataset [2019-07-29 05:30:49] -- 80555 proteins in dataset [2019-07-29 05:30:49] -- 80223 non-redundant protein ids in dataset ... Size of the sequence database: 80223 Size of the alignment database: 80223 Number of clusters: 80053 ... createseqfiledb ... nfin = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' ; famprefix = 'NRPROT' ; dirout = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_families' ; padlen = 6 ; writeseq = False ; discardsingle = False Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/split_mmseqs_clustdb_fasta.py", line 58, in with open(nfin, 'r') as fin: IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' listed 0 redundant sequences in dataset ... parsing redundant sequence fasta filtered 80223 non-redundant sequences multiline feature: WP_085964153.1 multiline feature: WP_021081430.1 Warning: multiline feature not pointing at the same product: WP_021081429.1 and WP_021081430.1 Create new locus_tag to disembiguate loci: C207_RS28875_1 -> WP_021081430.1 createseqfiledb ... nfin = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters' ; famprefix = 'PANTAGFAMP' ; dirout = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta' ; padlen = 6 ; writeseq = True ; discardsingle = False Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/split_mmseqs_clustdb_fasta.py", line 58, in with open(nfin, 'r') as fin: IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters' [2019-07-29 05:32:21] -- 0 non-redundant proteins [2019-07-29 05:32:21] -- classified into 1 clusters -- including artificial cluster PANTAGFAMP000000 gathering 0 ORFan nr proteins -- (NB: some are not true ORFans as can be be present as identical sequences in several genomes) Pantagrel pipeline task 1: complete.

In short, (1) there is no such file '01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' (2) there is no such file '01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters' (3) eventually got 0 non-redundant proteins, classified into 1 clusters.

And these look like a failure.

In the 01.seqdb directory I got lots of files: ls testPTGdatabase/01.seqdb/

all_proteomes.clusthashdb_minseqid100.0 all_proteomes.clusthashdb_minseqid100_clusters.1 all_proteomes.mmseqsdb.dbtype all_proteomes.nr.mmseqsdb.lookup all_proteomes.clusthashdb_minseqid100.1 all_proteomes.clusthashdb_minseqid100_clusters.2 ... all_proteomes.clusthashdb_minseqid100_clusters.0 all_proteomes.mmseqsdb all_proteomes.nr.mmseqsdb.index

but only one protein family fasta appeared in ls testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta/

PANTAGFAMP000000.fasta

Task 02 I proceeded to the task 02 despite the 01's results, pantagruel -i testPTGdatabase/environ_pantagruel_testPTGdatabase.sh 02

Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/extract_full_prot_and_cds_family_alignments.py", line 469, in main(dirnrprotaln, nfsingletonfasta, nfprotinfotab, nfreplinfotab, dirassemb, dirout, fam_prefix, dirlogs, nfidentseq, nbcores, verbose) File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/extract_full_prot_and_cds_family_alignments.py", line 217, in main lastfam = int(allprotfams[-1].split(prefixprotfam)[-1]) IndexError: list index out of range cat: /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/02.gene_alignments/full_families_genome_counts-noORFans.mat: No such file or directory tail: cannot open '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/02.gene_alignments/PANTAGFAMC000000_genome_counts-ORFans.mat' for reading: No such file or directory

In short, messages were saying that (1) extract_full_prot_and_cds_family_alignments.py", line 469 extract_full_prot_and_cds_family_alignments.py", line 217 list index out of range (2) no such file '02.gene_alignments/full_families_genome_counts-noORFans.mat' (3) no such file '02.gene_alignments/PANTAGFAMC000000_genome_counts-ORFans.mat'

At this point I concluded that the run on test dataset failed on my side. Because I used the test dataset that you used, I suspect that the problem likely lies in the installation & environment that I have here, isn't it? I wonder if you can, based on the messages, diagnose or narrow down where the fault is?

I also tried with my data again, using the latest installation, to see if I get the same error messages. From my own data, init produced all the other things in the same way except that, interestingly, 00.input_data/extracted_cds_from_genomic_fasta/ WAS EMPTY. ls Buniformis_pantagruel/00.input_data/extracted_cds_from_genomic_fasta/

01 and 02 reported exactly the same message as the test run did:

To summarize, on my side even the most recent version failed to run on both the test dataset and my own dataset. How is your view on these instance?

If you also suspect that something is wrong in my installation/ environment/ dependencies you can also take a look at the log generated from install_dependencies.sh. I've just repeated "install_dependencies.sh" to check what it says. log_from_pantagruel_install_dependencies.sh.txt

Thanks, Kihyun

Hi Kihyun, thanks for running Pantagruel on the test dataset and reporting your installation logs, indeed it is helping to find the issues.

1) about the extracted CDS folder:

The above part caught my eyes, because there was EXTRACTED CDS for ONLY ONE INPUT genome

this is normal, as this is only done on genome assemblies where the file *_cds_from_genomic.fna.gz is absent, i.e. the assemblies not obtained from NCBI RefSeq - so in the case of the test dataset and in general, the custom assemblies annotated by prokka within the pipeline task 00. So in that respect I find your run of the task 00 to complete OK on the test dataset. What is NOT normal is that when applied on your dataset (made of custom unannotated genome assemblies) nothing was produced in that folder...

2) the errors then start to accumulate from task 01, where it seems that the mmseqs createseqfiledb after the first clustering step of MMseqs (100% prot identity clustering with clusthash algorithm) has failed. You can check the normal output in the attached log: pantagruel_testPTGdatabase_01.log Because of that, nothing further can work properly.

3) one reason for this failure on your computer might stem, as you suggested, from an incomplete install. Indeed, your install log report missing Debian packages; the installation script aborts at this point - sorry this was not clear from the log, I just made that explicit in the standard output of the script (c265450) - and other dependencies that are installed subsequently (like MMSeqs) could be missing. Somehow this could have led to the mmseqs createseqfiledb call to fail... but you seem to have MMseqs installed (as the first lines of the clusthash step worked), so not sure why this one would have failed. Actually if you could spell out these specific log lines that you omitted after createseqfiledb ... it would be very helpful!

I hope we can work through that quickly. However I expect the incomplete installation to be a recurring problem for many users. This why we have planned to release a Docker image with all the dependencies statically coded in it (see #11), but this release is being delayed unfortunately.

In the meantime, I can only suggest you make sure that you have the dependencies installed. The Debian packages you are missing are R packages, which will be needed in the later tasks. If you can't install them, at least try and run the code that comes after the apt install lines of the install script (you can copy-paste the code lines 0-96 and then 138-end).

best wishes, Florent

What is NOT normal is that when applied on your dataset (made of custom unannotated genome assemblies) nothing was produced in that folder...

one reads on why that would be is if you re-ran the script on a pre-existing database that would already have produced the file Bu_x.1_Bacteroides_uniformis_Bu_x_cds_from_genomic.fna.gz (it was located somewhere else in previous versions) and linked it to the 00.input_data/assemblies/Bu_x.1_Bacteroides_uniformis_Bu_x/ folder, in which case the script won't find it necessary to recreate it. Is that the case?

2. Actually if you could spell out these specific log lines that you omitted after createseqfiledb ... it would be very helpful!

if you are re-running this step, note that I just made the standard output on task 01 more verbose; mmseqs logs are now globally captured and redirected to $ptglogs/mmseqs/mmseqs-*.log (a1fa017).

Hi Florent,

What happened in 01 in your normal vs. my failed runs

it seems that the mmseqs createseqfiledb after the first clustering step of MMseqs (100% prot identity clustering with clusthash algorithm) has failed. You can check the normal output in the attached log: pantagruel_testPTGdatabase_01.log https://github.com/flass/pantagruel/files/3441184/pantagruel_testPTGdatabase_01.log

I compared the normal log and mine. These lines were different though I am not sure if these differences have significance:

[Normal] MMseqs Version: 7-4e23d [Mine] MMseqs Version: 9-d36de

[Normal] - [Mine] Database type 0

[Normal] - [Mine] Compressed 0

[Normal] Touch data file /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.mmseqsdb ... Done. [Mine] -

[Normal] Touch data file /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.mmseqsdb_h ... Done. [Mine] -

[Normal] Reduced amino acid alphabet: A F X [Mine] Reduced amino acid alphabet: (A B C D E G K N P Q R S T Z) (F H I J L M V W Y) (X)

[Normal] Touch data file /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.mmseqsdb ... Done. [Mine] -

[Normal] listed 170 redundant sequences in dataset [Mine] listed 0 redundant sequences in dataset

Part of the log where my error appeared:

[Normal] nfin = '/pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' ; famprefix = 'NRPROT' ; dirout = '/pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_families' ; padlen = 6 ; writeseq = False ; discardsingle = False [Mine] nfin = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters' ; famprefix = 'NRPROT' ; dirout = '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_families' ; padlen = 6 ; writeseq = False ; discardsingle = False Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/split_mmseqs_clustdb_fasta.py", line 58, in with open(nfin, 'r') as fin: IOError: [Errno 2] No such file or directory: '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters'

createseqfiledb command in the logs Comparing normal and mine: there is no difference.

[Normal] createseqfiledb /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.mmseqsdb /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clust /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters [Mine]
createseqfiledb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.mmseqsdb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clust /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters

[Normal] createseqfiledb /pantagruel_databases/testPTGdatabase/01.seqdb/all_proteomes.nr.mmseqsdb /pantagruel_databases/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default /pantagruel_databases/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters [Mine]
createseqfiledb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/all_proteomes.nr.mmseqsdb /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters

And the this time I also brought that log files that were written in logs/mmseqs/ mmseqs-0-identicalprot-clusthash.log mmseqs-1-cluster.log

one reads on why that would be is if you re-ran the script on a pre-existing database that would already have produced the file Bu_x.1_Bacteroides_uniformis_Bu_x_cds_from_genomic.fna.gz (it was located somewhere else in previous versions) and linked it to the 00.input_data/assemblies/Bu_x.1_Bacteroides_uniformis_Bu_x/ folder, in which case the script won't find it necessary to recreate it. Is that the case?

I believe that this was not the case because I changed the database directory to {database}_garbage_1 before each re-run. Still, I re-used input directory all the time and that might be a problem?

The Debian packages you are missing are R packages, which will be needed in the later tasks. If you can't install them, at least try and run the code that comes after the apt install lines of the install script (you can copy-paste the code lines 0-96 and then 138-end).

Thanks for this advice! I extracted line 0-96, 138-end of the installation script and the installation went well I guess. The output from installation script:

Installation of Pantagruel and dependencies: ...

get/update git repositories for Pantagruel pipeline remote: Enumerating objects: 45, done. remote: Counting objects: 100% (45/45), done. remote: Compressing objects: 100% (21/21), done. remote: Total 36 (delta 28), reused 23 (delta 15), pack-reused 0 Unpacking objects: 100% (36/36), done. From https://github.com/flass/pantagruel ce0cbf0..16863db master -> origin/master Updating ce0cbf0..16863db Fast-forward README.md | 20 ++++++++++++++++++-- install_dependencies.sh | 59 +++++++++++++++++++++++++++++++++++++++++------------------ scripts/pipeline/environ_pantagruel_template.sh | 1 + scripts/pipeline/pantagruel_pipeline_05_core_genome_ref_tree.sh | 8 ++++---- scripts/pipeline/pantagruel_pipeline_init.sh | 14 ++++++++------ scripts/pipeline/pantagruel_pipeline_master.sh | 32 ++++++++++++++++++++++++++++++-- 6 files changed, 102 insertions(+), 32 deletions(-) /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline

Requirement already satisfied: bcbio-gff in /usr/local/lib/python2.7/dist-packages Requirement already satisfied: six in /usr/lib/python2.7/dist-packages (from bcbio-gff) Requirement already satisfied: bioscripts.convert in /usr/local/lib/python2.7/dist-packages Requirement already satisfied: biopython>=1.49 in /usr/lib/python2.7/dist-packages (from bioscripts.convert) Requirement already satisfied: setuptools in /usr/lib/python2.7/dist-packages (from bioscripts.convert) Requirement already satisfied: numpy in /usr/lib/python2.7/dist-packages (from biopython>=1.49->bioscripts.convert)

R version 3.6.1 (2019-07-05) -- "Action of the Toes" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

source("https://bioconductor.org/biocLite.R") Error: With R version 3.5 or greater, install Bioconductor packages using BiocManager; see https://bioconductor.org/install Execution halted WARNING: Could not install R package 'topGO'; testing of GO term enrichment in clade-specific gene sets will not be available

found Prokka already installed with Brew: ==> Formulae brewsci/bio/prokka ✔

found mmseqs2 already installed with Brew: ==> Formulae mmseqs2 ✔

found pal2nal.pl executable: /home/kihyunee/bin/pal2nal.pl linking to /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pal2nal.v14/pal2nal.pl

found MAD executable: /home/kihyunee/bin/mad linking to /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/mad/mad

found LSD executable: /home/kihyunee/bin/lsd linking to /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/lsd_unix

found ALE suite already installed with Docker: boussau/alesuite latest b543e47b2462 24 months ago 911MB found up-to-date version of Interproscan at /mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/interproscan-5.36-75.0/interproscan.sh Succesfully linked pantagruel executable to /home/kihyunee/bin/ Installation of Pantagruel and dependencies: complete

Having done this partial installation but it didn't change the errors resulting from 01

Can you find a useful information from the createseqfiledb log?

Error log tells me that these files are not there, testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters and testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters and when I check if files with similar names are present there are

testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.0
testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.3
testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.1
testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.dbtype
testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.2
testPTGdatabase/01.seqdb/all_proteomes.clusthashdb_minseqid100_clusters.index

and

testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.0
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.1
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.2
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.3
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.dbtype
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters.index
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta.tab
testPTGdatabase/01.seqdb/protein_families/all_proteomes.nr.mmseqs_clusterdb_default_clusters_fasta/PANTAGFAMP000000.fasta

Always thank you for kind advices, Kihyun

Dear Kihyun,

Thanks for your report.

I think you pointed at the main cause of failure: the versions of mmseqs differ. The change log of release 8-fac81 notably mentions that there would be 'breaking changes' due to change in output format. I guess that explains the bugs. I'll have a look at either integrating the newest version of MMseqs (that should be done ultimately) or making the install script select the release of MMseqs that is currently supported by Pantagruel (i.e. 7-4e23d).

Just had a look and specifying the older version may not be possible via Homebrew so maybe I'll just do the changes for the newest version; I'll keep you (and @pveber for Docker integration) updated.

Best wishes, Florent

Hi Kihyun,

I'm please to say that it was a very easy fix to provide support for the version 8+ of MMseqs (commits 7c96fb4 and 42bb679). The output from prior versions (v7 and earlier) should still be supported too. So now Pantagruel is up-to-date with respect to the version of MMSeqs v9 as provided through Homebrew.

@pveber can we have MMseqs release 9-d36de installed in the dockerfile now please?

@kihyunee can you please tun the test again (on the test data or your own dataset) when you can, see if this has fixed the issue for good?

@flass thank you a lot for the encouraging news. I will do the test data and my own test data with the recent version. I have been/am out of office but as soon as possible I will come back with the result! Finger crossed!

No hurry! let me know whaen you've got the test outcome.

Just a point regarding the minor bug in prokka annotation that prevented the compilation of a custom reference protein blast database:

'/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformisgenomes' ls: cannot access '//__genomic.gbff.gz': No such file or directory parallel: Error: Cannot open input file `/assemblies_genomic_gbffgz_list': No such file or directory. removed '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/tmp/Reference.faa' removed '/mnt/disks/permanentDisk/genomics/pantagruel_testruns/Buniformis_pantagruel/tmp/Reference_representative.faa.clstr'

This is due to an issue with Homebrew's prokka v1.13 that cannot find Bioperl libs cf. https://github.com/tseemann/prokka/issues/403. This bug is circumvented by commit 43bcab5. In addition @tseemann has corrected the bug in newly released prokka v1.14, which is currently being packaged for Homebrew (and will then automatically become the version used by Pantagruel).

@kihyunee let me know if prokka 1.14 brew didn't solve the problems, and i'll make a 1.14.1 release!

Thank you @tseemann; using prokka 1.14.0 solved the issue on my side.

Hi Florent, I came here to say that the test runs are successful (up to now from 00 to the middle of 06) after updating both the Prokka (1.13 to 1.14.0) and the pantagruel .

First I want to say how things went wrong before the most recent update. Weeks ago you said that

I'm please to say that it was a very easy fix to provide support for the version 8+ of MMseqs (commits 7c96fb4 and 42bb679). The output from prior versions (v7 and earlier) should still be supported too. So now Pantagruel is up-to-date with respect to the version of MMSeqs v9 as provided through Homebrew.

Using the updated pantagruel (at that moment, which was e9fc0518511c2781cf4c5f9f0e00f23a3108e4aa) I ran test run with your test data. Buy the way, prokka version was 1.13 at the moment.

Task 00 to 03 completed smoothly. There was no error message and I got nice output files, for example, unlike before, I got many gene families; alignment files for each.

7236 final non-ORFan CDS families, including 43 homogeneous (singleton protein derived) families 8146 ORFan / 83304 total CDSs

Then I tried 05 06 07 as the tasks 00 01 02 03 were completed. Error messages appeared:

Final choice of 1293 pseudo-core unicopy gene families (present in at least 11 genomes). pseudocoremingenomes=11 Traceback (most recent call last): File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/concat.py", line 93, in main() File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/concat.py", line 66, in main aln=lib_util.AlnGenerator(f) File "/mnt/disks/permanentDisk/install_on_pd/pantagruel_pipeline/pantagruel/scripts/lib_util.py", line 597, in AlnGenerator raise InvalidFile, "The specified alignment file %s is of unknown format. "%(filename) lib_util.InvalidFile: The specified alignment file /mnt/disks/permanentDisk/genomics/pantagruel_testruns/testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004050.codes.aln is of unknown format. ERROR: failed to produce concatenated (pseudo)core-genome alignment ERROR: Pantagrel pipeline task 5: failed.

Interestingly, that particular gene alignment file mentioned in the error message was uniquely empty, while other gene family files were not empty. Like this:

$ grep -c ">" testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC00405*
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004050.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004053.codes.aln:10
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004054.codes.aln:17
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004057.codes.aln:8
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004059.codes.aln:11

Actually there were total 20 empty alignment files in that directory:

$ grep -c ">" testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC00* | grep ":0" 
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC000273.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC000501.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC000920.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC001446.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC001471.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC002874.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC002930.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC003493.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004050.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004300.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC004592.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC005627.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC005777.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC006073.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC006985.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC007070.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC007074.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC007201.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC007207.codes.aln:0
testPTGdatabase/02.gene_alignments/full_cdsfam_alignments_species_code/PANTAGFAMC007222.codes.aln:0

The above situation was before updating to the most recent pantagruel version and prokka 1.14.0. As I saw your comment on a fixable bug related to the previous prokka version, I tried the second test run (tasks 00-03 + 05-07) with the latest pantagruel version (0c450bce8eadccf9d29a7c4611e674eb83d3e302) and prokka 1.14.0 (updated by brew upgrade prokka)

I don’t know how the error that I saw in 05 may anyhow be related to the issue with previous prokka 1.13, or with the previous pantagruel version, but the error is gone now.

Test data went smoothly through tasks 00,01,02,03,05, and 06 (gene trees) is going on smoothly up to now. MrBayes outputs are accumulating under testPTGdatabase/06.gene_trees/fullgenetree_mrbayes_trees/ and I can guess that it will take a while for the remaining 06, 07 steps to be finished. Without wating until then, I just want to say that this test run looks successful. Then I tested my own test data containing 9 user-provided genomes, and got to the same point. Tasks 00-03, 05 were successfully completed and task 06 is running without any error message.

So, I can say that all the issues are resolved now, do you agree? Thanks to you and @tseemann. Later if something happens in the remaining steps such as 08 and 09 I will report you here.

Thanks! Kihyun

Hi Kihyun,

happy to hear the tests and actual use of the pipeline is going well now.

The bug you report about empty alignments is not related to prokka; it was due to an incomplete small edit in task 02 introduced in commit c448f11 and recently corrected in commits 9f3b83a, fbe916c, fbe916c (after private bug reports when we were proofing the dockerfile integration).

Steps 07-09 are not unlikely to bring errors, first because ALE (reconciliation method used in task 07) has its own issues (reported here), which are hopefully to be fixed soon by @Boussau or @ssolo; and second because the final downstream tasks have not been extensively tested against varied datasets.

So please report any other problems, but please do so in another separate issue, for the sake of clarity to other users that might encounter similar problems. I will close the current issue as the orignal problem has been solved.

Many thanks Florent

flass / pantagruel

Task 5 failed; Tasks 0,1,2,3 were completed but their output files do not look complete. #12