ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

Error running SuperFreq script - Error in load(RsaveFile) #59

Closed dsampath31 closed 4 years ago

dsampath31 commented 4 years ago

Hi,

I started running my superfreq.R script with all the metadata and 2 reference normals. I see this Variants DB in my directory but why does this fail to load?

>>>>Rscript test_superfreq_run.R Loading required package: WriteXLS Loading required package: BiocGenerics Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which, which.max, which.min

Loading required package: GenomicRanges Loading required package: stats4 Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

expand.grid

Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: biomaRt Loading required package: Rsamtools Loading required package: Biostrings Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

strsplit

Loading required package: R.oo Loading required package: R.methodsS3 R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help. R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.

Attaching package: ‘R.oo’

The following object is masked from ‘package:R.methodsS3’:

throw

The following object is masked from ‘package:GenomicRanges’:

trim

The following object is masked from ‘package:IRanges’:

trim

The following objects are masked from ‘package:methods’:

getClasses, getMethods

The following objects are masked from ‘package:base’:

attach, detach, load, save

Loading required package: Rsubread Loading required package: limma

Attaching package: ‘limma’

The following object is masked from ‘package:BiocGenerics’:

plotMA

Loading required package: MutationalPatterns Loading required package: NMF Loading required package: pkgmaker Loading required package: registry

Attaching package: ‘pkgmaker’

The following object is masked from ‘package:S4Vectors’:

new2

Loading required package: rngtools Loading required package: cluster NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 15/16 To enable shared memory capabilities, try: install.extras(' NMF ')

Attaching package: ‘NMF’

The following object is masked from ‘package:S4Vectors’:

nrun

Loading required package: BSgenome.Mmusculus.UCSC.mm10 Loading required package: BSgenome Loading required package: rtracklayer Loading required package: BSgenome.Hsapiens.UCSC.hg19 Loading required package: BSgenome.Hsapiens.UCSC.hg38

Attaching package: ‘BSgenome.Hsapiens.UCSC.hg38’

The following object is masked from ‘package:BSgenome.Hsapiens.UCSC.hg19’:

Hsapiens

Attaching package: ‘superFreq’

The following object is masked from ‘package:limma’:

plotMA

The following object is masked from ‘package:BiocGenerics’:

plotMA

Splitting meta data into participants. Loading sample meta data from file...done. Planning to run over these participants: sim_0.2 Now running: Fri Sep 4 06:33:22 2020 : sim_0.2 ...

2020-09-04 06:33:22 ###################################################################### Running superFreq version 1.4.1 Testing samtools... samtools 1.10 Using htslib 1.10 Copyright (C) 2019 Genome Research Ltd. Found samtools 1.10 . Seems ok. Runtime tracking and QC information printed to /aws-storage/rcecloud/ra/tumor_clonality/dsampath/superfreq_results/sim_0.2/runtimeTracking.log. Starting run with input files: sampleMetaDataFile: /aws-storage/rcecloud/ra/tumor_clonality/dsampath/splitMetaData/sim_0.2.tsv vcfFiles:

Normal directory: /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam Normal coverage directory: /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam dbSNP directory: superFreqResources/dbSNP capture regions: will be downloaded from superFreq server. Plotting to /aws-storage/rcecloud/ra/tumor_clonality/dsampath/plots/sim_0.2 Saving R files to /aws-storage/rcecloud/ra/tumor_clonality/dsampath/superfreq_results/sim_0.2 Genome is hg38 Running in exome mode. exacPopulation is all Running on at most 4 cpus. Rare germline variants are shown in output.

Parameters for this run are: maxCov: 150 systematicVariance: 0.03 cloneDistanceCut: 2.326348 cosmicSalvageRate: 0.001

Normal bamfiles are: /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam/TC_normal.bam /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam/TC_normal2.bam Normal bamfiles are: /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam/TC_normal.bam /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/bam/TC_normal2.bam Loading capture regions..done. Imported capture regions with 233285 regions and 23567 unique gene names. Mean GC content is 0.481. Loading sample meta data from file...done. Deciding which pairs to scatter plot..done. Deciding which time series to plot..done. ##################################################################################################

2020-09-04 06:33:23 Imported and sanity checked meta data. Looking good so far! metadata: BAM VCF INDIVIDUAL NAME TIMEPOINT NORMAL /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/TC_sim_0.2.bam /aws-storage/rcecloud/ra/tumor_clonality/dsampath/outputs/TC_sim_0.2-null.vcf.gz sim_0.2 TC_sim_0.2 diagnosis NO

timeSeries: TC_sim_0.2

##################################################################################################

2020-09-04 06:33:24 Current memory use (Mb): 632.3, max use (Mb): 632.3 Starting differential coverage analysis by sample. Loading saved differential coverage results. Loaded saved differential coverage results 2020-09-04 06:33:26 Current memory use (Mb): 691.2, max use (Mb): 691.2 Plotting volcanoes to /aws-storage/rcecloud/ra/tumor_clonality/dsampath/plots/sim_0.2/volcanoes/..done! 2020-09-04 06:33:27 Current memory use (Mb): 691.2, max use (Mb): 691.2 Using variants by individual.

Variants for sim_0.2 Loading saved variants for sim_0.2 Importing dbSNP allele frequencies from superFreqResources/dbSNP/hg38/dbAFnew.Rdata...Error in load(RsaveFile) : error reading from connection Calls: superFreq ... analyse -> getVariantsByIndividual -> matchTodbSNPs -> load Execution halted

ChristofferFlensburg commented 4 years ago

Hi!

It could be some problem with the file being corrupted in some way, or (less likely) it might also be an issue with permissions. Try reading it in manually in R to confirm you get the same message: load('superFreqResources/dbSNP/hg38/dbAFnew.Rdata') (it's pretty big, so takes ~20s or so)

Might be worth deleting the file and rerun superFreq, and it should automatically redownload it. The file is 678MB, so might have been problems with the download. You can try downloading it manually from https://gitlab.wehi.edu.au/flensburg.c/superFreq/-/tree/master/dbSNP/hg38 if there are problems with the automated download through R, and you can also check that the file size matches.

Let me know if that doesn't solve it.

ChristofferFlensburg commented 4 years ago

Actually, I see that you are running on the cloud, so might've been a problem with downloads not being allowed. I haven't run it on the cloud myself, but you may have to do the one-time download of the resources locally (just run superFreq, and it'll download what you need from the gitlab at the very start of the run, just make sure to match genome hg38 and mode exome to get the correct resources), and then upload that to the cloud together with your data. You can point to the single instance of the resource directory as an option in the superFreq() R function.

dsampath31 commented 4 years ago

Hi,

Thank you very much for mentioning the file size of the database. Mine was only 200Mb, so I removed the downloaded files and ran the script again it finished successfully :) for a 8Gb BAM file runtime of 40 mins is awesome! thanks again

ChristofferFlensburg commented 4 years ago

Great to hear! 👍

I take it I can close the issue then? Let me know if not.