Closed fredjaya closed 3 days ago
This is what the current results/
folder looks like for a single barcode:
results
├── annotations
│ ├── barcode01
│ │ ├── abricate
│ │ │ └── barcode01_consensus_chr.txt
│ │ ├── amrfinderplus
│ │ │ └── barcode01_consensus_chr.tsv
│ │ ├── bakta
│ │ │ ├── barcode01_consensus_chr.faa
│ │ │ └── barcode01_consensus_chr.txt
│ │ └── plasmids
│ │ └── barcode01_bakta
├── assemblies
│ ├── barcode01_consensus
│ │ └── consensus.fasta
│ ├── barcode01_flye
│ │ └── consensus.fasta
│ ├── barcode01_plassembler
│ │ ├── flye_output
│ │ ├── logs
│ │ ├── plassembler_1730446699.3410256.log
│ │ ├── plassembler_plasmids.fasta
│ │ ├── plassembler_plasmids.gfa
│ │ ├── plassembler_summary.tsv
│ │ └── unicycler_output
│ ├── barcode01_unicycler
│ │ └── consensus.fasta
├── quality_control
│ ├── barcode01
│ │ ├── barcode01_consensus
│ │ ├── barcode01_consensus_busco
│ │ ├── barcode01_consensus.tsv
│ │ ├── barcode01_flye
│ │ ├── barcode01_flye_busco
│ │ ├── barcode01_flye.tsv
│ │ ├── barcode01_unicycler
│ │ ├── barcode01_unicycler_busco
│ │ └── barcode01_unicycler.tsv
│ ├── barcode01_kraken2
│ │ └── barcode01.k2report
├── report
│ ├── barcode01_consensus
│ ├── barcode01_flye
│ ├── barcode01_unicycler
├── run_info
│ ├── dag.svg
│ ├── gadi-nf-core-trace-*.txt
│ ├── report.html
│ └── timeline.html
├── taxonomy
│ ├── abricate_vfdb_output.txt
│ ├── amrfinderplus_output.txt
│ ├── barcode_species_table_mqc.txt
│ ├── combined_plot_mqc.png
│ └── phylogeny
└── tree
350 directories, 234 files
Suggestions:
All works lovely until run_orthofinder
:
Run script:
#!/bin/bash
#PBS -P er01
#PBS -l walltime=10:00:00
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -l storage=scratch/tj48
#PBS -l jobfs=100GB
## RUN FROM PROJECT DIRECTORY WITH: bash test/run_test.sh
# Load version of nextflow with plug-in functionality enabled
module load nextflow/24.04.1
module load singularity
# Define inputs
samplesheet=/scratch/tj48/gs5517/ONT-bacpac-nf/samplesheet.csv
k2db=/scratch/tj48/databases/kraken2_db/
sequencing_summary=/scratch/tj48/fj9712/00_raw/sequencing_summary.txt
gadi_account=er01 #e.g. aa00
gadi_storage=scratch/tj48+scratch/er01
# Unhash this command to run pipeline with samplesheet
nextflow run main.nf \
--samplesheet ${samplesheet} \
--kraken2_db ${k2db} \
--sequencing_summary ${sequencing_summary} \
--gadi_account ${gadi_account} \
--gadi_storage ${gadi_storage} \
-resume -profile gadi #you can remove ,high_accuracy if you want to run fast basecalling samples
Error message:
ERROR ~ Error executing process > 'run_orthofinder (GENERATE PHYLOGENY)'
Caused by:
Process `run_orthofinder (GENERATE PHYLOGENY)` terminated with an error exit status (1)
Command executed:
# Description: Generate a phylogeny tree with orthofinder tool
# Using mafft and fastree
orthofinder \
-f phylogeny \
-o phylogeny_tree \
-n tree \
-t 16 \
-a 16
Command exit status:
1
Command output:
OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms
2024-11-11 22:29:50 : Starting OrthoFinder 2.5.5
16 thread(s) for highly parallel tasks (BLAST searches etc.)
16 thread(s) for OrthoFinder algorithm
Checking required programs are installed
----------------------------------------
Test can run "mcl -h" - ok
Test can run "fastme -i phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.phy -o phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.tre" - ok
WARNING: Files have been ignored as they don't appear to be FASTA files:
Escherichia_coli_REF_GCF_000005845.2_ASM584v2.fna
OrthoFinder expects FASTA files to have one of the following extensions: fas, fasta, pep, fa, faa
ERROR: At least two species are required
ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.
Command error:
/usr/local/bin/scripts_of/tree.py:367: SyntaxWarning: invalid escape sequence '\-'
"""
/usr/local/bin/scripts_of/tree.py:1422: SyntaxWarning: invalid escape sequence '\-'
"""
/usr/local/bin/scripts_of/newick.py:54: SyntaxWarning: invalid escape sequence '\['
_ILEGAL_NEWICK_CHARS = ":;(),\[\]\t\n\r="
/usr/local/bin/scripts_of/newick.py:57: SyntaxWarning: invalid escape sequence '\['
_NHX_RE = "\[&&NHX:[^\]]*\]"
/usr/local/bin/scripts_of/newick.py:58: SyntaxWarning: invalid escape sequence '\d'
_FLOAT_RE = "[+-]?\d+\.?\d*(?:[eE][-+]\d+)?"
/usr/local/bin/scripts_of/newick.py:60: SyntaxWarning: invalid escape sequence '\['
_NAME_RE = "[^():,;\[\]]+"
/usr/local/bin/scripts_of/newick.py:337: SyntaxWarning: invalid escape sequence '\s'
MATCH = '%s\s*%s\s*(%s)?' % (FIRST_MATCH, SECOND_MATCH, _NHX_RE)
/usr/local/bin/scripts_of/probroot.py:10: SyntaxWarning: invalid escape sequence '\i'
"""
/usr/local/bin/scripts_of/probroot.py:201: SyntaxWarning: invalid escape sequence '\l'
"""
/usr/local/bin/scripts_of/probroot.py:267: SyntaxWarning: invalid escape sequence '\l'
"""
Work dir:
/scratch/tj48/gs5517/ONT-bacpac-nf/work/ee/062a5772a8ab3e36b326533776093a
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details
Tested and works on all Vibrio and Tenacibaculum barcodes. Currently on
/scratch/er01/fj9712/2411_wholetest
- to be moved.Things to discuss and address either in this PR or later ones:
Does it work on the cat and dog data? (Fred currently testing)Yes, except forbarcode24
-/scratch/tj48/fj9712/02_work/2411_catdogs
Implementation
Reference-free chromosome assembly selection
Addresses #54, #23
For the chromosomal assembly, every barcode is assembled by flye and unicycler, and polished. The single "best" polished assembly out of flye, unicycler, and optionally trycycler (consensus assembly), is selected for downstream annotation and analyses.
To avoid biasing assemblies to published references, the assembly with the most complete BUSCOs is considered the best one. This now allows unicycler assemblies to be considered too. QUAST is also run but not used for selecting the best assembly.
Now only has a implementation for chromosomal assembly, instead of two independent ones, to make updating the criteria for selecting an assembly easier. For example, to incorporate QUAST outputs, or add additional tools like Merqury.
Trycycler implementation
Addresses #43, #60
Trycycler processes are now self-contained. Additional assemblers can be implemented easier to generate better consensus assemblies if required.
Added more error handling for too-few-contigs (trycycler cluster filters more out). If trycycler correctly fails at any point, the pipeline will still continue and select either the flye or unicycler assembly for downstream processes.
Input/output process definitions are more explicit (i.e. specific files instead of globs) for better error handling. A lot more operators and groovy in the workflow scope as a result.