ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

VariantAnnotation step failing in v1.5.1 #116

Closed samleenz closed 11 months ago

samleenz commented 11 months ago

https://github.com/ChristofferFlensburg/superFreq/blob/dd98abd82c183b9faafacca9a244766ed2df3c24/R/runVariantAnnotation.R#L13

Hi Chris, I'm having some issues with the latest version (v1.5.1) in RNA mode, I've added the error below and some details about the run. Running on R 4.3

it's failing at the VariantAnnotation step - at a guess for some reason q is a vector and not a data-frame?

I've had the same issue whether I run with multiple individuals or a single individual (with multiple samples),

Do you have any ideas where I should start with trying to debug this?

Thanks!


This is the tail of the slurm log for the job

Running VariantAnnotation.
Loading annotation dump...done.
Splitting up 130602 variants for parallelisation into 13 batches.
Setting up data bases..done.
Running annotation by batch.............done.
Matching to known cancer variants by batch............done.
Error in q$severity : $ operator is invalid for atomic vectors
Calls: <Anonymous> ... superFreq -> analyse -> annotateSomaticQs -> <Anonymous>
In addition: Warning message:
In mclapply(qList, function(q) { :
  all scheduled cores encountered errors in user code
Execution halted

here's the contents of the R script I was submitting to slurm

library(superFreq)

params <- list(
  metadata_file = "/path/to/analysis/dir/meta_tables/BRO001_metadata.tsv",
  normal_dir = "/path/to/analysis/dir/normals/",
  r_dir = "/path/to/analysis/dir/R",
  plot_dir = "/path/to/analysis/dir/plot",
  reference = "/some/path/ensembl108/Homo_sapiens.GRCh38.dna.primary_assembly.fa",
  genome = "hg38",
  cpus = 12,
  resource_dir = "/path/to/analysis/dir/superFreqResources",
  mode = "RNA"
)

superFreq::superFreq(
  metaDataFile = params$metadata_file, ### <- not default
  normalDirectory = params$normal_dir, ### <- not default
  Rdirectory = params$r_dir, ### <- not default
  plotDirectory = params$plot_dir, ### <- not default
  reference = params$reference, ### <- not default
  genome = params$genome, ### <- not default
  cpus = params$cpus, ### <- not default
  # forceRedo = superFreq::forceRedoNothing(),
  forceRedo = superFreq::forceRedoEverything(), ### <- tried both strategies
  resourceDirectory = params$resource_dir, ### <- not default
  mode = params$mode, ### <- not default
  BQoffset = 33,
  # captureRegions = "",
  outputToTerminalAsWell = T,
  normalCoverageDirectory = "",
  systematicVariance = 0.03,
  maxCov = 150,
  cloneDistanceCut = -qnorm(0.01),
  splitRun = T,
  participants = "all",
  manualStoryMerge = F,
  vepCall = "vep",
  correctReferenceBias = T,
  filterOffTarget = T,
  rareGermline = T,
  exacPopulation = "all",
  cosmicSalvageRate = 0.001,
  annotationMethod = "VariantAnnotation",
  ploidyPriors = NULL,
  isPairedEnd = TRUE,
  countReadPairs = TRUE
)

Variants were precalled with the below code on samtools v1.18 and varscan v2.3.9

samtools mpileup -d 1000 -q 15 -Q 15 -A -f $genome  $bam | \
    varscan  mpileup2cns - --variants --strand-filter 0 --p-value 0.01 \
    --min-var-freq 0.02 --output-vcf 1 > $vcf
ChristofferFlensburg commented 11 months ago

Hi!

Yeah seems there's an error in the annotation. The error is inside the parallelisation, and then R doesn't pass the error message properly, only gives a warning that the runs "encountered errors", and returns NA or NULL or something, which typically leads to a downstream error soon after, here because the NA counts as a vector, as opposed to the expected data frame.

I think easiest would be if you rerun with cpus=1, then it won't parallelise, and it'll return a (hopefully) informative error message. Also send through the output in "/path/to/analysis/dir/R/runtimeTracking.log", which may have clues.

I have two suspects at this point. Could be a version issue. I tested on R 4.2.3, while you run on a later version 4.3. VariantAnnotation, the package I rely on for variant annotation, is sometimes making quite drastic changes between versions and it might be that the R 4.3 version of it breaks some compatibility with superFreq 1.5.1 (which is for R 4.2.3). Otherwise, I have been making some small changes to that part of the code recently, so while it passed my tests, there might still be bugs. The second is probably the easier to fix.

On a different note, over 100k somatic variants sounds very high, especially for an RNA-Seq sample with a matched normal. You might want to keep it in mind, but that'll be easier to QC once the run is through. For now, maybe just double check that "/path/to/analysis/dir/meta_tables/BRO001_metadata.tsv" has the cancer sample and the correct matched normal in it.

samleenz commented 11 months ago

Thanks for getting back so quickly!

As it turns out the error was an oversight on my part - I'd downloaded the resource directory using the previous version of superfreq! I emptied that to force it to re-download the resources and it's just finished running without issue on a couple of different patients :) It could be an idea for the resources to get re-downloaded if forceRedoEverything()is used -- or a warning to be printed if you don't want to replace the files without prompting?

Re the variant number -> I'll have to look more into the QC to see if something has gone wrong there. I'll follow up with you soon about that if its ok? Unfortunately we don't actually have matched normal samples at this point. It may be an option to generate some down the track but at the moment we're working with multiple metastatic sites per patient