AVR-biosecurity-bioinformatics / freyr

A Nextflow-based metabarcoding pipeline for agricultural biosecurity and biosurveillance
0 stars 0 forks source link

Handle case where no samples exceed `min_sample_reads` loci parameter #3

Open jackscanlan opened 4 months ago

jackscanlan commented 4 months ago

Why this matters

Ideal behaviour

Would be good if PHYLOSEQ_FILTER still produces the expected output files when all samples are removed from a locus, but they're just empty files

Detail

Error currently occurs when params.subsample = 1 using test_data/dual (-profile test), as min_sample_reads = 1000:

ERROR ~ Error executing process > 'FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)'

Caused by:
  Process `FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env Rscript

  ### defining Nextflow environment variables as R variables
  ## input channel variables
  pcr_primers =           "fwhF2-fwhR2nDac"
  ps =                    "ps_unfiltered_fwhF2-fwhR2nDac.rds"
  target_kingdom =        "Metazoa"
  target_phylum =         "Arthropoda"
  target_class =          "Insecta"
  target_order =          "Diptera"
  target_family =         "NA"
  target_genus =          "NA"
  target_species =        "NA"
  min_sample_reads =      "1000"
  min_taxa_reads =        "NA"
  min_taxa_ra =           "1e-4"

  ## global variables
  projectDir = "/group/pathogens/IAWS/Personal/JackS/dev/freyr"
  params_dict = "[help:null, data_folder:test_data/dual, refdir:reference, samplesheet:test_data/dual/samplesheet_read_dir.csv, loci_params:test_data/dual/loci_params.csv, extension:null, illumina:true, pacbio:false, nanopore:false, paired:true, high_sensitivity:true, primer_n_trim:false, threads:null, rdata:true, subsample:1, slurm_account:pathogens, max_memory:2.GB, max_cpus:1, max_time:10.m]"

  tryCatch({
  ### source functions and themes, load packages, and import Nextflow params
  ### from "bin/process_start.R"
  sys.source("/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/process_start.R", envir = .GlobalEnv)

  ### run module code
  sys.source(
      "/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/phyloseq_filter.R", # run script
      envir = .GlobalEnv # this allows import of existing objects like projectDir
  )
  }, finally = {
  ### save R environment for debugging
  if ("true" == "true") { save.image(file = "FREYR:PHYLOSEQ_FILTER_2.rda") } 
  })

Command exit status:
  1

Command output:

  2024-07-16T15:22:30 Pulling Image: docker:jackscanlan/piperline-multi:0.0.1, status: READY
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  [1] "Input variable 'projectDir' = /group/pathogens/IAWS/Personal/JackS/dev/freyr"
  [1] "Input variable 'pcr_primers' = fwhF2-fwhR2nDac"
  [1] "Input variable 'ps' = ps_unfiltered_fwhF2-fwhR2nDac.rds"
  [1] "Input variable 'target_kingdom' = Metazoa"
  [1] "Input variable 'target_phylum' = Arthropoda"
  [1] "Input variable 'target_class' = Insecta"
  [1] "Input variable 'target_order' = Diptera"
  [1] "Input variable 'target_family' = NA"
  [1] "Input variable 'target_genus' = NA"
  [1] "Input variable 'target_species' = NA"
  [1] "Input variable 'min_sample_reads' = 1000"
  [1] "Input variable 'min_taxa_reads' = NA"
  [1] "Input variable 'min_taxa_ra' = 1e-4"

Command error:
  Loading required package: BiocGenerics

  Attaching package: ‘BiocGenerics’

  The following objects are masked from ‘package:stats’:

      IQR, mad, sd, var, xtabs

  The following objects are masked from ‘package:base’:

      Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
      as.data.frame, basename, cbind, colnames, dirname, do.call,
      duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
      lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
      pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
      tapply, union, unique, unsplit, which.max, which.min

  Loading required package: S4Vectors
  Loading required package: stats4

  Attaching package: ‘S4Vectors’

  The following objects are masked from ‘package:base’:

      I, expand.grid, unname

  Loading required package: IRanges
  Loading required package: XVector
  Loading required package: GenomeInfoDb
  Error in validObject(.Object) : invalid class “otu_table” object: 
   OTU abundance data must have non-zero dimensions.
  Calls: tryCatch ... .nextMethod -> callNextMethod -> .nextMethod -> validObject
  In addition: There were 23 warnings (use warnings() to see them)
  Execution halted

Work dir:
  /group/pathogens/IAWS/Personal/JackS/dev/freyr/work/31/d78d5fc68ccb75f471371649ac5a1d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

This occurs because the following code does not produce an "empty" phyloseq object when pruning out all samples, instead returning an error:

ps1 %>%
        phyloseq::prune_samples(phyloseq::sample_sums(.)>=min_sample_reads, .)

Have added a more informative error to phyloseq_filter.R, but this doesn't really solve the issue (ie. produce empty outputs and allow the run to continue):

if(min_sample_reads > 0){
    if (all(phyloseq::sample_sums(ps1)>=min_sample_reads) == FALSE) {
        stop(paste0("ERROR: No samples contained reads above the minimum threshold of ", min_sample_reads, " -- consider lowering this value"))
        }
...
}

To-do:

Need a way to either:

  1. produce empty phyloseq objects when pruning, which could be used to produce empty output files, or
  2. produce empty output files directly
alexpiper commented 3 months ago

There was a little bug in the error handling you added which caused it to stop when any samples were below min_sample_reads, addressed in https://github.com/AVR-biosecurity-bioinformatics/freyr/commit/cf917fbf3d9aab66a2cb2f35a34db7edfa06aeb5

alexpiper commented 3 months ago

Not sure there will be a way to generate an empty phyloseq object, with the way phyloseq::otu_table() is set up.

It may be worth just using base R to filter the matrices, rather than relying on phyloseq. The guts of the phyloseq function are doing this anyway: https://github.com/joey711/phyloseq/blob/master/R/transform_filter-methods.R