NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
10 stars 8 forks source link

Pipeline fail at BUSCO and PURGE_DUPS #67

Closed gbdias closed 10 months ago

gbdias commented 10 months ago

Describe the bug Pipeline execution trace shows failed status for the process EVALUATE_ASSEMBLY:BUSCO, and aborted status for the process PURGE_DUPLICATES:MINIMAP2_ALIGN_READS.

To Reproduce Steps to reproduce the behavior:

RESULTS="${PWD/analyses/data/outputs}"
NEXTFLOW_OPTS=${NEXTFLOW_OPTS:-"-resume -ansi-log false"}
export NXF_SINGULARITY_CACHEDIR=${NXF_SINGULARITY_CACHEDIR:-"/proj/snic2021-6-194/nobackup/ebp-singularity-cache"}

source activate nextflow-env

nextflow run /home/guibo205/git/NBIS/Earth-Biogenome-Project-pilot $NEXTFLOW_OPTS \
    -profile uppmax,execution_report \
    --input assembly_parameters.yml \
    --outdir "${RESULTS}" \
    --project 'naiss2023-5-307' \
    -c custom.config

nextflow clean -f -before $( nextflow log -q | tail -n 1 )
# Mandatory - sample metadata
sample:
  id: 'Gomphus_clavatus'
  kmer_size: 31
  ploidy: 2
  busco_linages:
    - 'bacteria_odb10'
    - 'basidiomycota_odb10'
    - 'agaricomycetes_odb10'
# Optional - frozen/finalized assemblies
#assembly:
#  - id: 'prefix-buildID'
#    pri_fasta: '/path/to/data'
#    alt_fasta: '/path/to/data'
# Optional - Hi-C data if available
#hic:
#  - read1: ''
#    read2: '/path/to/data'
# Optional - HiFi data if available
hifi:
  - reads: '/proj/snic2021-6-194/VREBP-Gomphus_clavatus-2023-AsmAnno/data/raw-data/PacBio-HiFi-WGS/hifiwgs.fastq.gz'
#  - reads: '/path/to/data'
# Optional - RNASeq data if available
rnaseq:
  - read1: '/proj/snic2021-6-194/VREBP-Gomphus_clavatus-2023-AsmAnno/data/raw-data/Illumina-RNAseq/rnaseq_R1.fastq.gz'
    read2: '/proj/snic2021-6-194/VREBP-Gomphus_clavatus-2023-AsmAnno/data/raw-data/Illumina-RNAseq/rnaseq_R2.fastq.gz'
# Optional - Isoseq data if available
isoseq:
  - reads: '/proj/snic2021-6-194/VREBP-Gomphus_clavatus-2023-AsmAnno/data/raw-data/PacBio-HiFi-ISOSEQ/hq_transcripts.fasta'

Expected behavior Purge_dups and BUSCO should complete and generate the expected outputs.

Screenshots

#! /usr/bin/env bash
#SBATCH -A naiss2023-5-307
#SBATCH -p core
#SBATCH -n 1
#SBATCH -t 3-00:00:00
#SBATCH -J gc_ebp

RESULTS="${PWD/analyses/data/outputs}"
NEXTFLOW_OPTS=${NEXTFLOW_OPTS:-"-resume -ansi-log false"}
export NXF_SINGULARITY_CACHEDIR=${NXF_SINGULARITY_CACHEDIR:-"/proj/snic2021-6-194/nobackup/ebp-singularity-cache"}

source activate nextflow-env

nextflow run /home/guibo205/git/NBIS/Earth-Biogenome-Project-pilot $NEXTFLOW_OPTS \
    -profile uppmax,execution_report \
    --input assembly_parameters.yml \
N E X T F L O W  ~  version 23.04.1
WARN: It appears you have never run this project before -- Option `-resume` is ignored
Launching `/home/guibo205/git/NBIS/Earth-Biogenome-Project-pilot/main.nf` [happy_hilbert] DSL2 - revision: c23d5c5980

    Running NBIS Earth Biogenome Project Assembly workflow.

Pulling Singularity image docker://quay.io/biocontainers/hifiasm:0.19.8--h43eeafb_0 [cache /proj/snic2021-6-194/nobackup/ebp-singularity-cache/quay.io-biocontainers-hifiasm-0.19.8--h43eeafb_0.img]
[40/ca3078] Submitted process > BUILD_HIFI_DATABASES:FASTK_FASTK (Gomphus_clavatus)
Staging foreign file: https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh
[ee/835901] Submitted process > HIFIASM (Gomphus_clavatus)
[6f/50f0ad] Submitted process > SCREEN_READS:MASH_SCREEN (Gomphus_clavatus)
[59/8de15b] Submitted process > GENOME_PROPERTIES:MERQURYFK_PLOIDYPLOT (Gomphus_clavatus)
[56/5808a4] Submitted process > GENOME_PROPERTIES:MERQURYFK_KATGC (Gomphus_clavatus)
[2d/ebf96a] Submitted process > GENOME_PROPERTIES:FASTK_HISTEX (Gomphus_clavatus)
[9f/32940b] Submitted process > GENOME_PROPERTIES:GENESCOPEFK (Gomphus_clavatus)
[cb/44927f] Submitted process > SCREEN_READS:MASH_FILTER (Gomphus_clavatus)
[bb/58374f] Submitted process > GFASTATS (Gomphus_clavatus)
[45/eed2ad] Submitted process > GFATOOLS_GFA2FA (Gomphus_clavatus)
[20/36f3b4] Submitted process > GFATOOLS_GFA2FA (Gomphus_clavatus)
[1b/9ad4a6] Submitted process > GFASTATS (Gomphus_clavatus)
[e9/dd136b] Submitted process > EVALUATE_ASSEMBLY:BUSCO (hifiasm-auto)
[b6/dcd967] Submitted process > PURGE_DUPLICATES:PURGEDUPS_SPLITFA_PRIMARY (hifiasm)
[05/981ea1] Submitted process > PURGE_DUPLICATES:MINIMAP2_ALIGN_READS (hifiasm)
[05/a16ade] Submitted process > COMPARE_ASSEMBLIES:QUAST (Gomphus_clavatus)
[ff/ff9cc9] Submitted process > PURGE_DUPLICATES:MINIMAP2_ALIGN_ASSEMBLY_PRIMARY (hifiasm)
ERROR ~ Error executing process > 'EVALUATE_ASSEMBLY:BUSCO (hifiasm-auto)'

Caused by:
  Missing output file(s) `*-busco/*/run_*/busco_sequences` expected by process `EVALUATE_ASSEMBLY:BUSCO (hifiasm-auto)`

Command executed:

  # Nextflow changes the container --entrypoint to /bin/bash (container default entrypoint: /usr/local/env-execute)
  # Check for container variable initialisation script and source it.
  if [ -f "/usr/local/env-activate.sh" ]; then
      set +u  # Otherwise, errors out because of various unbound variables
      . "/usr/local/env-activate.sh"
      set -u
  fi

  # If the augustus config directory is not writable, then copy to writeable area
  if [ ! -w "${AUGUSTUS_CONFIG_PATH}" ]; then
      # Create writable tmp directory for augustus
      AUG_CONF_DIR=$( mktemp -d -p $PWD )
      cp -r $AUGUSTUS_CONFIG_PATH/* $AUG_CONF_DIR
      export AUGUSTUS_CONFIG_PATH=$AUG_CONF_DIR
      echo "New AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH}"
  fi

  # Ensure the input is uncompressed
  INPUT_SEQS=input_seqs
  mkdir "$INPUT_SEQS"
  cd "$INPUT_SEQS"
  for FASTA in ../tmp_input/*; do
      if [ "${FASTA##*.}" == 'gz' ]; then
          gzip -cdf "$FASTA" > $( basename "$FASTA" .gz )
      else
          ln -s "$FASTA" .
      fi
  done
  cd ..

  busco \
      --cpu 6 \
      --in "$INPUT_SEQS" \
      --out hifiasm-auto-busco \
      --auto-lineage \
       \
       \
      --mode genome

  # clean up
  rm -rf "$INPUT_SEQS"

  # Move files to avoid staging/publishing issues
  mv hifiasm-auto-busco/batch_summary.txt hifiasm-auto-busco.batch_summary.txt
  mv hifiasm-auto-busco/*/short_summary.*.{json,txt} . || echo "Short summaries were not available: No genes were found."

  cat <<-END_VERSIONS > versions.yml
  "EVALUATE_ASSEMBLY:BUSCO":
      busco: $( busco --version 2>&1 | sed 's/^BUSCO //' )
  END_VERSIONS

Command exit status:
  0

Command output:
  2023-11-14 00:51:40 INFO:     [hmmsearch]     51 of 255 task(s) completed
  2023-11-14 00:51:41 INFO:     [hmmsearch]     77 of 255 task(s) completed
  2023-11-14 00:51:41 INFO:     [hmmsearch]     102 of 255 task(s) completed
  2023-11-14 00:51:41 INFO:     [hmmsearch]     128 of 255 task(s) completed
  2023-11-14 00:51:42 INFO:     [hmmsearch]     153 of 255 task(s) completed
  2023-11-14 00:51:43 INFO:     [hmmsearch]     179 of 255 task(s) completed
  2023-11-14 00:51:44 INFO:     [hmmsearch]     204 of 255 task(s) completed
  2023-11-14 00:51:44 INFO:     [hmmsearch]     230 of 255 task(s) completed
  2023-11-14 00:51:47 INFO:     [hmmsearch]     255 of 255 task(s) completed
  2023-11-14 00:51:48 INFO:     Results:        C:94.1%[S:93.7%,D:0.4%],F:4.3%,M:1.6%,n:255

  2023-11-14 00:51:48 INFO:     Extracting missing and fragmented buscos from the file refseq_db.faa...
  2023-11-14 00:51:50 INFO:     Running 1 job(s) on metaeuk, starting at 11/14/2023 00:51:50
  2023-11-14 00:57:32 INFO:     [metaeuk]       1 of 1 task(s) completed
  2023-11-14 00:57:33 INFO:     ***** Run HMMER on gene sequences *****
  2023-11-14 00:57:33 INFO:     Running 15 job(s) on hmmsearch, starting at 11/14/2023 00:57:33
  2023-11-14 00:57:34 INFO:     [hmmsearch]     2 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     3 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     5 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     6 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     8 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     9 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     11 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     12 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     14 of 15 task(s) completed
  2023-11-14 00:57:34 INFO:     [hmmsearch]     15 of 15 task(s) completed
  2023-11-14 00:57:41 INFO:     Validating exons and removing overlapping matches
  2023-11-14 00:57:42 INFO:     Results:        C:96.1%[S:95.7%,D:0.4%],F:2.7%,M:1.2%,n:255

  2023-11-14 00:57:43 INFO:     eukaryota_odb10 selected

  2023-11-14 00:57:43 INFO:     ***** Searching tree for chosen lineage to find best taxonomic match *****

  2023-11-14 00:57:44 INFO:     Extract markers...
  2023-11-14 00:57:44 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/list_of_reference_markers.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:44 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/list_of_reference_markers.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:44 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/tree.eukaryota_odb10.2019-12-16.nwk.tar.gz'
  2023-11-14 00:57:45 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/tree.eukaryota_odb10.2019-12-16.nwk.tar.gz'
  2023-11-14 00:57:45 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/tree_metadata.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:46 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/tree_metadata.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:46 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/supermatrix.aln.eukaryota_odb10.2019-12-16.faa.tar.gz'
  2023-11-14 00:57:49 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/supermatrix.aln.eukaryota_odb10.2019-12-16.faa.tar.gz'
  2023-11-14 00:57:49 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/mapping_taxids-busco_dataset_name.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:49 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/mapping_taxids-busco_dataset_name.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:49 INFO:     Downloading file 'https://busco-data.ezlab.org/v5/data/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:50 INFO:     Decompressing file '/scratch/42565899/nxf.TXgxyRoRSA/busco_downloads/placement_files/mapping_taxid-lineage.eukaryota_odb10.2019-12-16.txt.tar.gz'
  2023-11-14 00:57:50 INFO:     Place the markers on the reference tree...
  2023-11-14 00:57:50 INFO:     Running 1 job(s) on sepp, starting at 11/14/2023 00:57:50
  2023-11-14 01:00:52 INFO:     [sepp]  1 of 1 task(s) completed
  Short summaries were not available: No genes were found.

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  INFO:    Environment variable SINGULARITYENV_SNIC_TMP is set, but APPTAINERENV_SNIC_TMP is preferred
  2023-11-14 01:00:52 ERROR:    Placements failed. Try to rerun increasing the memory or select a lineage manually.
  mv: cannot stat 'hifiasm-auto-busco/*/short_summary.*.json': No such file or directory
  mv: cannot stat 'hifiasm-auto-busco/*/short_summary.*.txt': No such file or directory

Work dir:
  /crex/proj/snic2021-6-194/VREBP-Gomphus_clavatus-2023-AsmAnno/analyses/01_assembly-workflow_fourth-run_rackham/work/e9/dd136be530ebb9845f40d19f627a84

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

        The workflow completed unsuccessfully.

        Please read over the error message. If you are unable to solve it, please
        post an issue at https://github.com/NBISweden/Earth-Biogenome-Project-pilot/issues
        where we will do our best to help.

WARN: Killing running tasks (1)

Additional context Add any other context about the problem here.

mahesh-panchal commented 10 months ago

There's something strange here. Busco should be selecting the lineage if you've specified it. Ah, spelling error: busco_linages -> busco_lineages.

gbdias commented 10 months ago

Thanks! Any tips why PURGE_DUPS may have aborted?

Screenshot 2023-11-14 at 11 07 29
mahesh-panchal commented 10 months ago

It aborted because busco failed. Nextflow automatically terminates any queued jobs if a task fails. We can change this behaviour, but it's nice to have an early fail if something goes wrong.

mahesh-panchal commented 10 months ago

Tell me if the fix works and we can close the issue.

gbdias commented 10 months ago

Confirm the typo was the cause of BUSCO failing and PURGE_DUPS aborting.