Revising trycycler and select assembly implementations

fredjaya commented 3 weeks ago

Tested and works on all Vibrio and Tenacibaculum barcodes. Currently on /scratch/er01/fj9712/2411_wholetest - to be moved.

Things to discuss and address either in this PR or later ones:

How should the results directory be structured? What files should be output here? (e.g. #28)
Should the select assembly stats be reported?
What should go in the MultiQC report?
~~Does it work on the cat and dog data? (Fred currently testing)~~ Yes, except for barcode24 - /scratch/tj48/fj9712/02_work/2411_catdogs

Implementation

Reference-free chromosome assembly selection

Addresses #54, #23

For the chromosomal assembly, every barcode is assembled by flye and unicycler, and polished. The single "best" polished assembly out of flye, unicycler, and optionally trycycler (consensus assembly), is selected for downstream annotation and analyses.

To avoid biasing assemblies to published references, the assembly with the most complete BUSCOs is considered the best one. This now allows unicycler assemblies to be considered too. QUAST is also run but not used for selecting the best assembly.

Now only has a implementation for chromosomal assembly, instead of two independent ones, to make updating the criteria for selecting an assembly easier. For example, to incorporate QUAST outputs, or add additional tools like Merqury.

Trycycler implementation

Addresses #43, #60

Trycycler processes are now self-contained. Additional assemblers can be implemented easier to generate better consensus assemblies if required.

Added more error handling for too-few-contigs (trycycler cluster filters more out). If trycycler correctly fails at any point, the pipeline will still continue and select either the flye or unicycler assembly for downstream processes.

Input/output process definitions are more explicit (i.e. specific files instead of globs) for better error handling. A lot more operators and groovy in the workflow scope as a result.

fredjaya commented 3 weeks ago

This is what the current results/ folder looks like for a single barcode:

results
├── annotations
│   ├── barcode01
│   │   ├── abricate
│   │   │   └── barcode01_consensus_chr.txt
│   │   ├── amrfinderplus
│   │   │   └── barcode01_consensus_chr.tsv
│   │   ├── bakta
│   │   │   ├── barcode01_consensus_chr.faa
│   │   │   └── barcode01_consensus_chr.txt
│   │   └── plasmids
│   │       └── barcode01_bakta
├── assemblies
│   ├── barcode01_consensus
│   │   └── consensus.fasta
│   ├── barcode01_flye
│   │   └── consensus.fasta
│   ├── barcode01_plassembler
│   │   ├── flye_output
│   │   ├── logs
│   │   ├── plassembler_1730446699.3410256.log
│   │   ├── plassembler_plasmids.fasta
│   │   ├── plassembler_plasmids.gfa
│   │   ├── plassembler_summary.tsv
│   │   └── unicycler_output
│   ├── barcode01_unicycler
│   │   └── consensus.fasta
├── quality_control
│   ├── barcode01
│   │   ├── barcode01_consensus
│   │   ├── barcode01_consensus_busco
│   │   ├── barcode01_consensus.tsv
│   │   ├── barcode01_flye
│   │   ├── barcode01_flye_busco
│   │   ├── barcode01_flye.tsv
│   │   ├── barcode01_unicycler
│   │   ├── barcode01_unicycler_busco
│   │   └── barcode01_unicycler.tsv
│   ├── barcode01_kraken2
│   │   └── barcode01.k2report
├── report
│   ├── barcode01_consensus
│   ├── barcode01_flye
│   ├── barcode01_unicycler
├── run_info
│   ├── dag.svg
│   ├── gadi-nf-core-trace-*.txt
│   ├── report.html
│   └── timeline.html
├── taxonomy
│   ├── abricate_vfdb_output.txt
│   ├── amrfinderplus_output.txt
│   ├── barcode_species_table_mqc.txt
│   ├── combined_plot_mqc.png
│   └── phylogeny
└── tree

350 directories, 234 files

Suggestions:

[ ] publish only the selected chromosomal assembly, remove assembler from name
[ ] ...

georgiesamaha commented 2 weeks ago

All works lovely until run_orthofinder:

Run script:

#!/bin/bash

#PBS -P er01
#PBS -l walltime=10:00:00
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -l storage=scratch/tj48
#PBS -l jobfs=100GB

## RUN FROM PROJECT DIRECTORY WITH: bash test/run_test.sh

# Load version of nextflow with plug-in functionality enabled 
module load nextflow/24.04.1 
module load singularity 

# Define inputs 
samplesheet=/scratch/tj48/gs5517/ONT-bacpac-nf/samplesheet.csv 
k2db=/scratch/tj48/databases/kraken2_db/ 
sequencing_summary=/scratch/tj48/fj9712/00_raw/sequencing_summary.txt 
gadi_account=er01 #e.g. aa00
gadi_storage=scratch/tj48+scratch/er01 

# Unhash this command to run pipeline with samplesheet
nextflow run main.nf \
    --samplesheet ${samplesheet} \
    --kraken2_db ${k2db} \
    --sequencing_summary ${sequencing_summary} \
    --gadi_account ${gadi_account} \
    --gadi_storage ${gadi_storage} \
    -resume -profile gadi #you can remove ,high_accuracy if you want to run fast basecalling samples

Error message:

ERROR ~ Error executing process > 'run_orthofinder (GENERATE PHYLOGENY)'

Caused by:
  Process `run_orthofinder (GENERATE PHYLOGENY)` terminated with an error exit status (1)

Command executed:

  # Description: Generate a phylogeny tree with orthofinder tool 

  # Using mafft and fastree
   orthofinder \
        -f phylogeny \
        -o phylogeny_tree \
        -n tree \
        -t 16 \
        -a 16

Command exit status:
  1

Command output:

  OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms

  2024-11-11 22:29:50 : Starting OrthoFinder 2.5.5
  16 thread(s) for highly parallel tasks (BLAST searches etc.)
  16 thread(s) for OrthoFinder algorithm

  Checking required programs are installed
  ----------------------------------------
  Test can run "mcl -h" - ok
  Test can run "fastme -i phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.phy -o phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.tre" - ok

  WARNING: Files have been ignored as they don't appear to be FASTA files:
  Escherichia_coli_REF_GCF_000005845.2_ASM584v2.fna
  OrthoFinder expects FASTA files to have one of the following extensions: fas, fasta, pep, fa, faa
  ERROR: At least two species are required
  ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.

Command error:
  /usr/local/bin/scripts_of/tree.py:367: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/tree.py:1422: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/newick.py:54: SyntaxWarning: invalid escape sequence '\['
    _ILEGAL_NEWICK_CHARS = ":;(),\[\]\t\n\r="
  /usr/local/bin/scripts_of/newick.py:57: SyntaxWarning: invalid escape sequence '\['
    _NHX_RE = "\[&&NHX:[^\]]*\]"
  /usr/local/bin/scripts_of/newick.py:58: SyntaxWarning: invalid escape sequence '\d'
    _FLOAT_RE = "[+-]?\d+\.?\d*(?:[eE][-+]\d+)?"
  /usr/local/bin/scripts_of/newick.py:60: SyntaxWarning: invalid escape sequence '\['
    _NAME_RE = "[^():,;\[\]]+"
  /usr/local/bin/scripts_of/newick.py:337: SyntaxWarning: invalid escape sequence '\s'
    MATCH = '%s\s*%s\s*(%s)?' % (FIRST_MATCH, SECOND_MATCH, _NHX_RE)
  /usr/local/bin/scripts_of/probroot.py:10: SyntaxWarning: invalid escape sequence '\i'
    """
  /usr/local/bin/scripts_of/probroot.py:201: SyntaxWarning: invalid escape sequence '\l'
    """
  /usr/local/bin/scripts_of/probroot.py:267: SyntaxWarning: invalid escape sequence '\l'
    """

Work dir:
  /scratch/tj48/gs5517/ONT-bacpac-nf/work/ee/062a5772a8ab3e36b326533776093a

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

Sydney-Informatics-Hub / ONT-bacpac-nf