ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

findCorrespondingNormal returns missing value #102

Closed hosseinvk closed 1 year ago

hosseinvk commented 1 year ago

Hi Christoffer, When running superFreq, after about 12 minutes I get the following message:

Quitting from lines 331-366 (/tmp/Rtmp4d58zG/annette_report.Rmd) 
Error in if (!hasNormal) return(NA) : 
  missing value where TRUE/FALSE needed
Calls: <Anonymous> ... markSomatics -> <Anonymous> -> sapply -> sapply -> lapply -> FUN

Does that mean normal bam is not found?

Here is how I generated the metadata and the arguments passed to superFreq function:

bam <- list.files("../results_dir_HaplotyeCaller/preprocessing/recalibrated", pattern = "bam", full.names = TRUE, recursive=TRUE)
bam <- bam[!grepl("bai", bam)]
vcf <- list.files("../results_dir_HaplotyeCaller/variant_calling/haplotypecaller", pattern = "vcf.gz", full.names = TRUE, recursive=TRUE)
vcf <- vcf[!grepl("tbi", vcf)]
vcf <- vcf[grepl("filtered", vcf)]
name <- basename(vcf)
name <- gsub(".haplotypecaller.filtered.vcf.gz", "", name)
name <- name[!grepl("tbi", name)]

metadata <- tibble(BAM = bam, VCF = vcf, NAME = name, INDIVIDUAL = rep("IR9", 4),
                   NORMAL = c(rep("NO", 2), "YES", "NO"), 
                   TIMEPOINT = c(1, 2, 0, 4))

write_tsv(metadata, "metadata.tsv")

superFreq("metadata.tsv",
          captureRegions = "agilent/S30409818_Covered.bed",
          normalDirectory = "data/bam",
          Rdirectory = "data/Rdirectory",
          plotDirectory = "data/plotDirectory",
          reference = "../iGenomes/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
          genome = "hg38",
          mode = 'exome',
          cpus = 12)

The normalDirectory has the normal .bam and .bam.bai which are copied from ../results_dir_HaplotyeCaller/preprocessing/recalibrated

Thanks for help.

ChristofferFlensburg commented 1 year ago

Hi!

That part where it crashed is just matching up samples to their matched normal samples, ie another sample with the same INDIVIDUAL in the metadata, and YES in the NORMAL column. hasNormal is just a check if a sample has a matched normal or not, and guides how it'll do the somatic variant calling, with respect to the matched normal or guided by popluation frequencies from exac and dbSNP. It seems that hasNormal is NA in your run, which is causing the error.

It has already read in data from the bam and vcf files here, so that should not be the problem.

First thing that comes to mind, and easiest solution, is that there is something wrong in the meta data. The code you copied seems fine, but maybe you can paste the metaData.tsv file here? In particular we want to look at INDIVIDUAL and NORMAL columns. I know that some R functions like to add quotation marks around character entries when writing to file, which could confuse superFreq when reading in "YES" instead of just YES.

If that's not the cause, then I'd need to look closer. I recently pushed a new version to github, so send through a sessionInfo() together with the metaData.

hosseinvk commented 1 year ago

Dear Christoffer,

Attached are the runtimeTracking.log and metedata.tsv files. I need to confirm that, my run with Mutect2 variant caller output was successful but the issue here is with the haplotypecaller output. metadata.txt runtimeTracking.txt

I have changed the suffixes to .txt so to attache them here. Thanks.

ChristofferFlensburg commented 1 year ago

Hmm, so superFreq saves a lot of information to the R directory, and will re-use it possible. The log file you sent shows that all of the variant calls were imported from a previous run. So if you changed the vcf input files, but reran with the same R directory, then it'd re-use the old variants from the previous vcf files. So if you run twice with different vcf files, you need to use different R directories and different plot directories. For example Rdirectory_mutect and Rdirectory_haplotypecaller, and same for plots, and probably also have two version of the meta data files. In general, you want the VCF files to include both (liberal) somatic and germline variants, so you could here give the somatic variants from mutect VCF for the cancer samples (with open settings, superFreq will filter noisy variants), and haplotypecaller variants for the normal sample. SueprFreq cross-checks all variants across all samples, so a variant only has to be in one of the VCFs for the INDIVIDUAL.

I'm not sure how that'd cause this error, but as your mutect run went through, I'd suggest rerunning in a fresh Rdirectory and plotdirectory, either with all haplotypecaller VCF, or with the mixed VCFs as suggested above. You probably want to do that anyway. With some luck that'll magically work, and if not, then at least you can send me a log file that includes all of the variant calling logs that might have more clues for me.

hosseinvk commented 1 year ago

Hi Christoffer. I will redo my analyses following your instruction and update you accordingly. Many thanks.

hosseinvk commented 1 year ago

Hi Christoffer, I confirm that renaming Rdirectory, plotDirectory and metadata has let superFreq run and finish the job. Thank you for your kind support.