Handle case where no samples exceed `min_sample_reads` loci parameter

jackscanlan commented 4 months ago

Why this matters

Users could accidentally set min_sample_reads too high and need to know this is why the run fails
Could occur when doing extreme subsampling, as it is more likely very small numbers of samples will all fail the filter

Ideal behaviour

Would be good if PHYLOSEQ_FILTER still produces the expected output files when all samples are removed from a locus, but they're just empty files

Detail

Error currently occurs when params.subsample = 1 using test_data/dual (-profile test), as min_sample_reads = 1000:

ERROR ~ Error executing process > 'FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)'

Caused by:
  Process `FREYR:PHYLOSEQ_FILTER (fwhF2-fwhR2nDac)` terminated with an error exit status (1)

Command executed:

  #!/usr/bin/env Rscript

  ### defining Nextflow environment variables as R variables
  ## input channel variables
  pcr_primers =           "fwhF2-fwhR2nDac"
  ps =                    "ps_unfiltered_fwhF2-fwhR2nDac.rds"
  target_kingdom =        "Metazoa"
  target_phylum =         "Arthropoda"
  target_class =          "Insecta"
  target_order =          "Diptera"
  target_family =         "NA"
  target_genus =          "NA"
  target_species =        "NA"
  min_sample_reads =      "1000"
  min_taxa_reads =        "NA"
  min_taxa_ra =           "1e-4"

  ## global variables
  projectDir = "/group/pathogens/IAWS/Personal/JackS/dev/freyr"
  params_dict = "[help:null, data_folder:test_data/dual, refdir:reference, samplesheet:test_data/dual/samplesheet_read_dir.csv, loci_params:test_data/dual/loci_params.csv, extension:null, illumina:true, pacbio:false, nanopore:false, paired:true, high_sensitivity:true, primer_n_trim:false, threads:null, rdata:true, subsample:1, slurm_account:pathogens, max_memory:2.GB, max_cpus:1, max_time:10.m]"

  tryCatch({
  ### source functions and themes, load packages, and import Nextflow params
  ### from "bin/process_start.R"
  sys.source("/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/process_start.R", envir = .GlobalEnv)

  ### run module code
  sys.source(
      "/group/pathogens/IAWS/Personal/JackS/dev/freyr/bin/phyloseq_filter.R", # run script
      envir = .GlobalEnv # this allows import of existing objects like projectDir
  )
  }, finally = {
  ### save R environment for debugging
  if ("true" == "true") { save.image(file = "FREYR:PHYLOSEQ_FILTER_2.rda") } 
  })

Command exit status:
  1

Command output:

  2024-07-16T15:22:30 Pulling Image: docker:jackscanlan/piperline-multi:0.0.1, status: READY
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  2a3f21b84278741a6d7226093a69f1463064515cf6d5dbdd3fe172039b8b6c05
  [1] "Input variable 'projectDir' = /group/pathogens/IAWS/Personal/JackS/dev/freyr"
  [1] "Input variable 'pcr_primers' = fwhF2-fwhR2nDac"
  [1] "Input variable 'ps' = ps_unfiltered_fwhF2-fwhR2nDac.rds"
  [1] "Input variable 'target_kingdom' = Metazoa"
  [1] "Input variable 'target_phylum' = Arthropoda"
  [1] "Input variable 'target_class' = Insecta"
  [1] "Input variable 'target_order' = Diptera"
  [1] "Input variable 'target_family' = NA"
  [1] "Input variable 'target_genus' = NA"
  [1] "Input variable 'target_species' = NA"
  [1] "Input variable 'min_sample_reads' = 1000"
  [1] "Input variable 'min_taxa_reads' = NA"
  [1] "Input variable 'min_taxa_ra' = 1e-4"

Command error:
  Loading required package: BiocGenerics

  Attaching package: ‘BiocGenerics’

  The following objects are masked from ‘package:stats’:

      IQR, mad, sd, var, xtabs

  The following objects are masked from ‘package:base’:

      Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
      as.data.frame, basename, cbind, colnames, dirname, do.call,
      duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
      lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
      pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
      tapply, union, unique, unsplit, which.max, which.min

  Loading required package: S4Vectors
  Loading required package: stats4

  Attaching package: ‘S4Vectors’

  The following objects are masked from ‘package:base’:

      I, expand.grid, unname

  Loading required package: IRanges
  Loading required package: XVector
  Loading required package: GenomeInfoDb
  Error in validObject(.Object) : invalid class “otu_table” object: 
   OTU abundance data must have non-zero dimensions.
  Calls: tryCatch ... .nextMethod -> callNextMethod -> .nextMethod -> validObject
  In addition: There were 23 warnings (use warnings() to see them)
  Execution halted

Work dir:
  /group/pathogens/IAWS/Personal/JackS/dev/freyr/work/31/d78d5fc68ccb75f471371649ac5a1d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

This occurs because the following code does not produce an "empty" phyloseq object when pruning out all samples, instead returning an error:

ps1 %>%
        phyloseq::prune_samples(phyloseq::sample_sums(.)>=min_sample_reads, .)

Have added a more informative error to phyloseq_filter.R, but this doesn't really solve the issue (ie. produce empty outputs and allow the run to continue):

if(min_sample_reads > 0){
    if (all(phyloseq::sample_sums(ps1)>=min_sample_reads) == FALSE) {
        stop(paste0("ERROR: No samples contained reads above the minimum threshold of ", min_sample_reads, " -- consider lowering this value"))
        }
...
}

To-do:

Need a way to either:

produce empty phyloseq objects when pruning, which could be used to produce empty output files, or
produce empty output files directly

alexpiper commented 3 months ago

There was a little bug in the error handling you added which caused it to stop when any samples were below min_sample_reads, addressed in https://github.com/AVR-biosecurity-bioinformatics/freyr/commit/cf917fbf3d9aab66a2cb2f35a34db7edfa06aeb5

alexpiper commented 3 months ago

Not sure there will be a way to generate an empty phyloseq object, with the way phyloseq::otu_table() is set up.

It may be worth just using base R to filter the matrices, rather than relying on phyloseq. The guts of the phyloseq function are doing this anyway: https://github.com/joey711/phyloseq/blob/master/R/transform_filter-methods.R

AVR-biosecurity-bioinformatics / freyr