ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
108 stars 33 forks source link

Failed annotation - "data set ‘entrez2symbol_hs’ not found" #113

Closed gbss2 closed 9 months ago

gbss2 commented 9 months ago

Hi @ChristofferFlensburg,

We are facing some errors in SuperFreq trying to infer CNAs on RNA data.

The first error was related to the issue #100 (R 4.3.1 and SuperFreq 1.4.4):

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either FORMAT is missing or spaces are present instead of tabs:

CHROM POS ID REF ALT QUAL FILTER INFO COVERAGE VARIANTREADS VAF

Error in 'purrr::map()': ℹ In index: 1. Caused by error in 'h()': ! error in evaluating the argument 'x' in selecting a method for function 'granges': error in evaluating the argument 'x' in selecting a method for function 'seqinfo': no 'header' line "#CHROM POS ID..."? Backtrace: ▆

  1. ├─superFreq::superFreq(...)
  2. │ └─superFreq::superFreq(...)
  3. │ └─superFreq::analyse(...)
  4. │ └─superFreq:::plotProfiles(...)
  5. │ └─superFreq:::plot104profilesBySample(...)
  6. │ ├─BiocGenerics::lapply(...)
  7. │ └─base::lapply(...)
  8. │ └─superFreq (local) FUN(X[[i]], ...)
  9. │ └─superFreq:::get104profile(...)
    1. │ └─superFreq:::get96signature(...)
    2. │ └─MutationalPatterns::read_vcfs_as_granges(...)
    3. │ ├─... %>% GenomicRanges::GRangesList()
    4. │ └─purrr::map(...)
    5. │ └─purrr:::map_("list", .x, .f, ..., .progress = .progress)
    6. │ ├─purrr:::with_indexed_errors(...)
    7. │ │ └─base::withCallingHandlers(...)
    8. │ ├─purrr:::call_with_cleanup(...)
    9. │ └─MutationalPatterns (local) .f(.x[[i]], ...)
    10. │ ├─base::withCallingHandlers(...)
    11. │ ├─GenomicRanges::granges(VariantAnnotation::readVcf(vcf_file))
    12. │ ├─VariantAnnotation::readVcf(vcf_file)
    13. │ └─VariantAnnotation::readVcf(vcf_file)
    14. │ └─VariantAnnotation (local) .local(file, genome = genome, param = param, ...)
    15. │ └─VariantAnnotation:::.readVcf(...)
    16. │ ├─GenomeInfoDb::seqinfo(scanVcfHeader(file))
    17. │ ├─VariantAnnotation::scanVcfHeader(file)
    18. │ └─VariantAnnotation::scanVcfHeader(file)
    19. │ ├─Rsamtools::scanBcfHeader(file[[1]], ...)
    20. │ └─Rsamtools::scanBcfHeader(file[[1]], ...)
    21. │ └─BiocGenerics::Map(...)
    22. │ ├─BiocGenerics (local) standardGeneric("Map")
    23. │ │ ├─BiocGenerics::eval(mc, env)
    24. │ │ └─base::eval(mc, env)
    25. │ │ └─base::eval(mc, env)
    26. │ └─base::Map(f = f, ...)
    27. │ └─base::mapply(FUN = f, ..., SIMPLIFY = FALSE)
    28. │ └─Rsamtools (local) <fn>(dots[[1L]][[1L]])
    29. │ ├─Rsamtools::scanBcfHeader(bf)
    30. │ └─Rsamtools::scanBcfHeader(bf)
    31. ├─GenomicRanges::GRangesList(.)
    32. ├─base::.handleSimpleError(...)
    33. │ └─base (local) h(simpleError(msg, call))
    34. ├─base::.handleSimpleError(...)
    35. │ └─base (local) h(simpleError(msg, call))
    36. └─base::.handleSimpleError(...)
    37. └─purrr (local) h(simpleError(msg, call))
    38. └─cli::cli_abort(...)
    39. └─rlang::abort(...)

To solve this issue, we updated to the latest version on Github (SuperFreq 1.5.0). However, another error, now related to the annotation popped up:

Error in superFreq:::addMostSevereHit(q, allvar, coding, genome) : object 'entrez2symbol_hs' not found Calls: superFreq ... -> lapply -> lapply -> FUN -> In addition: Warning message: In data(entrez2symbol_hs) : data set ‘entrez2symbol_hs’ not found Execution halted

I looked up the code and found out that there is no previous reference to the object entrez2symbol_hs, and this may be generating the error. Is there a way to install version 1.4.5, cited on #100, or a workaround to this new error?

The run log from the last attempt is posted below:

R version 4.3.1 (2023-06-16) -- "Beagle Scouts" Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

library(superFreq)

maximum number of threads. Limited speed up above ~5 cpus for exomes and RNA-Seq and ~10-20 for genomes.

Better to parallelise across individuals for cohorts, see the cohort section in the github README.

cpus=10

this is the meta data input.

metaDataFile = 'new_analysis2023/dataTest.txt'

This directory with (links to) the reference normals needs to be created and set up.

normalDirectory = 'new_analysis2023/refNormals/bam/'

The reference fasta and name. Only hg19, hg38 and mm10 available atm.

reference = 'new_analysis2023/Homo_sapiens.GRCh38.dna.primary_assembly.fa' genome = 'hg38'

The directory where the log file and saved .Rdata is stored.

Rdirectory = 'new_analysis2023/superfreq2/myAnalysis/R'

The directory where all the plots and tables from the analysis go.

plotDirectory = 'new_analysis2023/superfreq2/myAnalysis/plots'

The mode. Default 'exome' is for exomes, while 'RNA' has some minor changes when running on RNA.

There is also a "genome" mode for genomes: ~24h for cancer-normal at 10 cpus, 200GB memory.

mode = 'RNA'

this performs the actual analysis. output goes to Rdirectory and plotDirectory.

runtime is typically less than 6 hours at 4 cpus for a cancer-normal exome, but can vary significantly depending on input.

For a typical cancer-normal exome, 5-10GB of memory is used per cpus, but again, can vary significantly depending on input.

later runs typically a bit faster as the setup and part of the analysis on the reference normals can be reused.

data = superFreq(metaDataFile, normalDirectory=normalDirectory, Rdirectory=Rdirectory, plotDirectory=plotDirectory,

reference=reference, genome=genome, cpus=cpus, mode=mode) Splitting meta data into participants. Loading sample meta data from file...done. Planning to run over these participants: SRR8518131 SRR8518008 Now running: Wed Nov 8 20:07:46 2023 : SRR8518131 ...

2023-11-08 20:07:46.125483 ###################################################################### Running superFreq version 1.5.0 SessionInfo():

R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: America/Sao_Paulo tzcode source: system (glibc)

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] superFreq_1.5.0 BSgenome.Hsapiens.UCSC.hg38_1.4.5 [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome.Mmusculus.UCSC.mm10_1.4.3 [5] BSgenome_1.70.1 rtracklayer_1.62.0
[7] BiocIO_1.12.0 MutationalPatterns_3.12.0
[9] NMF_0.26 bigmemory_4.6.1
[11] Biobase_2.62.0 cluster_2.1.4
[13] rngtools_1.5.2 registry_0.5-1
[15] limma_3.58.1 Rsubread_2.16.0
[17] R.oo_1.25.0 R.methodsS3_1.8.2
[19] Rsamtools_2.18.0 Biostrings_2.70.1
[21] XVector_0.42.0 biomaRt_2.58.0
[23] GenomicRanges_1.54.1 GenomeInfoDb_1.38.0
[25] IRanges_2.36.0 S4Vectors_0.40.1
[27] BiocGenerics_0.48.1 WriteXLS_6.4.0

loaded via a namespace (and not attached): [1] DBI_1.1.3 bitops_1.0-7
[3] rlang_1.1.2 magrittr_2.0.3
[5] gridBase_0.4-7 matrixStats_1.1.0
[7] compiler_4.3.1 RSQLite_2.3.3
[9] GenomicFeatures_1.54.1 png_0.1-8
[11] vctrs_0.6.4 reshape2_1.4.4
[13] ggalluvial_0.12.5 stringr_1.5.0
[15] pkgconfig_2.0.3 crayon_1.5.2
[17] fastmap_1.1.1 dbplyr_2.4.0
[19] utf8_1.2.4 pracma_2.4.2
[21] bit_4.0.5 zlibbioc_1.48.0
[23] cachem_1.0.8 progress_1.2.2
[25] blob_1.2.4 DelayedArray_0.28.0
[27] uuid_1.1-1 BiocParallel_1.36.0
[29] prettyunits_1.2.0 VariantAnnotation_1.48.0
[31] R6_2.5.1 stringi_1.7.12
[33] RColorBrewer_1.1-3 Rcpp_1.0.11
[35] SummarizedExperiment_1.32.0 iterators_1.0.14
[37] Matrix_1.6-1.1 tidyselect_1.2.0
[39] abind_1.4-5 yaml_2.3.7
[41] doParallel_1.0.17 codetools_0.2-19
[43] curl_5.1.0 lattice_0.21-8
[45] tibble_3.2.1 plyr_1.8.9
[47] KEGGREST_1.42.0 BiocFileCache_2.10.1
[49] xml2_1.3.5 pillar_1.9.0
[51] BiocManager_1.30.22 filelock_1.0.2
[53] MatrixGenerics_1.14.0 foreach_1.5.2
[55] generics_0.1.3 RCurl_1.98-1.13
[57] hms_1.1.3 ggplot2_3.4.4
[59] munsell_0.5.0 scales_1.2.1
[61] glue_1.6.2 tools_4.3.1
[63] GenomicAlignments_1.38.0 XML_3.99-0.14
[65] grid_4.3.1 AnnotationDbi_1.64.1
[67] colorspace_2.1-0 GenomeInfoDbData_1.2.11
[69] restfulr_0.0.15 cli_3.6.1
[71] rappdirs_0.3.3 bigmemory.sri_0.1.6
[73] fansi_1.0.5 S4Arrays_1.2.0
[75] dplyr_1.1.3 gtable_0.3.4
[77] digest_0.6.33 SparseArray_1.2.1
[79] rjson_0.2.21 memoise_2.0.1
[81] lifecycle_1.0.4 httr_1.4.7
[83] statmod_1.5.0 bit64_4.0.5

Testing samtools... samtools 1.12 Using htslib 1.12 Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details: Features: build=configure curses=yes CC: gcc CPPFLAGS:
CFLAGS: -Wall -g -O2 LDFLAGS:
HTSDIR: htslib-1.12 LIBS:
CURSES_LIB: -lncursesw

HTSlib compilation details: Features: build=configure plugins=no libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes htscodecs=1.0 CC: gcc CPPFLAGS:
CFLAGS: -Wall -g -O2 -fvisibility=hidden LDFLAGS: -fvisibility=hidden

HTSlib URL scheme handlers present: built-in: preload, data, file S3 Multipart Upload: s3w, s3w+https, s3w+http Amazon S3: s3+https, s3+http, s3 Google Cloud Storage: gs+http, gs+https, gs libcurl: imaps, pop3, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict crypt4gh-needed: crypt4gh mem: mem Found samtools 1.12 . Seems ok. Runtime tracking and QC information printed to new_analysis2023/superfreq2/myAnalysis/R/SRR8518131/runtimeTracking.log. Starting run with input files: sampleMetaDataFile: new_analysis2023/splitMetaData/SRR8518131.tsv vcfFiles:

Normal directory: new_analysis2023/refNormals/bam Normal coverage directory: new_analysis2023/refNormals/bam dbSNP directory: superFreqResources/dbSNP capture regions: will be downloaded from superFreq server. Plotting to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131 Saving R files to new_analysis2023/superfreq2/myAnalysis/R/SRR8518131 Genome is hg38 Running in RNA mode. exacPopulation is all Running on at most 10 cpus. Rare germline variants are shown in output.

Parameters for this run are: maxCov: 150 systematicVariance: 0.03 cloneDistanceCut: 2.326348 cosmicSalvageRate: 0.001

Normal bamfiles are: new_analysis2023/bam_normal/SRR8518131.ord.bam new_analysis2023/bam_normal/SRR8518132.ord.bam new_analysis2023/bam_normal/SRR8518134.ord.bam new_analysis2023/bam_normal/SRR8518136.ord.bam new_analysis2023/bam_normal/SRR8518140.ord.bam new_analysis2023/bam_normal/SRR8518142.ord.bam new_analysis2023/bam_normal/SRR8518147.ord.bam new_analysis2023/bam_normal/SRR8518153.ord.bam new_analysis2023/bam_normal/SRR8518176.ord.bam Normal bamfiles are: new_analysis2023/bam_normal/SRR8518131.ord.bam new_analysis2023/bam_normal/SRR8518132.ord.bam new_analysis2023/bam_normal/SRR8518134.ord.bam new_analysis2023/bam_normal/SRR8518136.ord.bam new_analysis2023/bam_normal/SRR8518140.ord.bam new_analysis2023/bam_normal/SRR8518142.ord.bam new_analysis2023/bam_normal/SRR8518147.ord.bam new_analysis2023/bam_normal/SRR8518153.ord.bam new_analysis2023/bam_normal/SRR8518176.ord.bam Loading capture regions...done. Imported capture regions with 233285 regions and 23567 unique gene names. Mean GC content is 0.481. Loading sample meta data from file...done. Deciding which pairs to scatter plot..done. Deciding which time series to plot..done. ##################################################################################################

2023-11-08 20:08:41.237889 Imported and sanity checked meta data. Looking good so far! metadata: BAM VCF INDIVIDUAL NAME TIMEPOINT NORMAL
new_analysis2023/bam_normal/SRR8518131.ord.bam new_analysis2023/bam_normal/SRR8518131.ord.vcf SRR8518131 Normal unrelated YES

timeSeries: Normal

##################################################################################################

2023-11-08 20:08:42.091514 Current memory use (Mb): 744.2, max use (Mb): 744.2 Free memory (Mb): 359.2877 Starting differential coverage analysis by sample. Preparing capture regions for featureCounts..done. Counting reads over capture regions.

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   Rsubread 2.16.0
//========================== featureCounts setting ===========================\ Input files : 1 BAM file
SRR8518131.ord.bam
Paired-end : yes
Count read pairs : yes
Annotation : R data.frame
Dir for temp files : .
Threads : 10
Level : feature level
Multimapping reads : counted
Multi-overlapping reads : counted
Min overlapping bases : 1

\============================================================================//

//================================= Running ==================================\ Load annotation file .Rsubread_UserProvidedAnnotation_pid2170617 ... Features : 233285 Meta-features : 23567 Chromosomes/contigs : 24
Process BAM file SRR8518131.ord.bam...
Paired-end reads are included.
Total alignments : 56974975
Successfully assigned alignments : 14702372 (25.8%)
Running time : 0.93 minutes
Write the final count table.
Write the read assignment summary.

\============================================================================//

Got a sample count matrix of size 233285 1 , with total counts: SRR8518131.ord.bam : 21619555 Saving sample counts to new_analysis2023/superfreq2/myAnalysis/R/SRR8518131/fCsExon.Rdata..done. Loading normals counts from file..done. Loaded normal counts of dimension 233285 9 Merging sample and normals counts..done. Determining sex..done. SAMPLE SCORE SEX
Normal -0.1313548 female
SRR8518131.ord -0.1313548 female
SRR8518132.ord -0.1329781 female
SRR8518134.ord -0.131437 female
SRR8518136.ord -0.1391745 female
SRR8518140.ord -0.1321561 female
SRR8518142.ord -0.1314838 female
SRR8518147.ord -0.1325124 female
SRR8518153.ord -0.1310214 female
SRR8518176.ord -0.1262411 female
Setting up design matrix for linear analysis..done. Design matrix is normal Normal
0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 Making MA plots of coverage in diagnostics directory..Normal..SRR8518131.ord..SRR8518132.ord..SRR8518134.ord..SRR8518136.ord..SRR8518140.ord..SRR8518142.ord..SRR8518147.ord..SRR8518153.ord..SRR8518176.ord..done. Correcting for sex chromsome coverage..done. Loess normalising counts to normals..done. Correcting for binding strength bias..Normal..SRR8518131.ord..SRR8518132.ord..SRR8518134.ord..SRR8518136.ord..SRR8518140.ord..SRR8518142.ord..SRR8518147.ord..SRR8518153.ord..SRR8518176.ord..done! Second round of loess normalisation..done. Returning sex effects to non-normal samples..done. Corrections changes average LFC between sample by a factor 1.1878 . Making MA plots of coverage after loess and BS correction in diagnostics directory..Normal..SRR8518131.ord..SRR8518132.ord..SRR8518134.ord..SRR8518136.ord..SRR8518140.ord..SRR8518142.ord..SRR8518147.ord..SRR8518153.ord..SRR8518176.ord..done. Running voom on exons..limma..XRank..Preparing empirical priors... done. Calculating posteriors: Normal-normal...done. Calculating expected ranks...Normal-normal..done. Calculating best guess...Normal-normal..done. done. Importing stats about feature counts..done. Running voom on genes..limma..XRank..Preparing empirical priors... done. Calculating posteriors: Normal-normal...done. Calculating expected ranks...Normal-normal..done. Calculating best guess...Normal-normal..done. done. Saving fit..done. Returning fit of dimension 23446 1 and 232303 1 2023-11-08 20:18:14.163565 Current memory use (Mb): 793.9, max use (Mb): 793.9 Plotting volcanoes to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131/volcanoes/..Normal-normal..Normal-normal..done! Writing different regions to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131/differentRegionsSamples.xls..Normal-normal..done! Writing different regions to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131/differentRegionsSamples.exons.xls..Normal-normal..outputting top 65k DE regions only, for excel...done! 2023-11-08 20:18:36.527005 Current memory use (Mb): 798.3, max use (Mb): 798.3 Using variants by individual.

Variants for SRR8518131 Reading file new_analysis2023/bam_normal/SRR8518131.ord.vcf...done. Processing data...done. Returning data frame of dimension 192680 6 Keeping 43259 out of 192680 (22.5%) SNVs that are inside capture regions. saving positions...done. 2023-11-08 20:18:38.629277 Examining 43259 positions from new_analysis2023/bam_normal/SRR8518131.ord.bam in 44 batches. 1.2.3.4.5.6.7.8.9.10.....11...12..13...14.15.16.17.18.19.......20..21.22.23.24.25.26.....27....28.29.30.31..32.33...34..35.36......37.38.39.40.41..42...43..44.........done! done. Variants: 49962 Unflagged : 45983 Unflagged over 0% freq : 42773 Unflagged over 20% freq : 40194 Median coverage over unflagged variants: 17 Repeat flags: 331 Mapping quality flags: 0 Base quality flags: 134 Strand bias flags: 127 Single variant read flags: 2068 Minor variant flags: 3487 Stutter flags: 0 Adding uncalled variants..done! saving variants to new_analysis2023/superfreq2/myAnalysis/R/SRR8518131/variants.SRR8518131.Rdata ...done. Importing dbSNP allele frequencies from superFreqResources/dbSNP/hg38/dbAFnew.Rdata...done. Match against variants...done. Importing exac allele frequencies from superFreqResources/dbSNP/hg38/exac.Rdata...done. Match against variants...Saving variants..done. Plotting frequency distributions to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131/diagnostics/frequencyDistribution/..Normal..done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518131.ord_4503177866.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518131.ord_6011526748.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518131.ord_7643629334.Rdata . Can reuse 47467 calls. Filling in missing normal variants from SRR8518131.ord. 2023-11-08 20:27:03.402528 Examining 2495 positions from new_analysis2023/bam_normal/SRR8518131.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 4746 Unflagged : 2457 Unflagged over 0% freq : 2450 Unflagged over 20% freq : 1367 Median coverage over unflagged variants: 38 Repeat flags: 24 Mapping quality flags: 0 Base quality flags: 47 Strand bias flags: 18 Single variant read flags: 1227 Minor variant flags: 2249 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518131.ord_3034316223.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518132.ord_2043492248.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518132.ord_3707472133.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518132.ord_6937955683.Rdata . Can reuse 45671 calls. Filling in missing normal variants from SRR8518132.ord. 2023-11-08 20:27:59.01335 Examining 3175 positions from new_analysis2023/bam_normal/SRR8518132.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5788 Unflagged : 3085 Unflagged over 0% freq : 2176 Unflagged over 20% freq : 1192 Median coverage over unflagged variants: 45 Repeat flags: 64 Mapping quality flags: 0 Base quality flags: 79 Strand bias flags: 60 Single variant read flags: 1466 Minor variant flags: 2409 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518132.ord_4706342805.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518134.ord_1252283567.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518134.ord_887333342.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518134.ord_978487389.Rdata . Can reuse 45873 calls. Filling in missing normal variants from SRR8518134.ord. 2023-11-08 20:28:47.761948 Examining 3145 positions from new_analysis2023/bam_normal/SRR8518134.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 6000 Unflagged : 3155 Unflagged over 0% freq : 2355 Unflagged over 20% freq : 1230 Median coverage over unflagged variants: 42 Repeat flags: 77 Mapping quality flags: 0 Base quality flags: 53 Strand bias flags: 32 Single variant read flags: 1537 Minor variant flags: 2599 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518134.ord_7498392703.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518136.ord_1899043322.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518136.ord_467246874.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518136.ord_6668271746.Rdata . Can reuse 45676 calls. Filling in missing normal variants from SRR8518136.ord. 2023-11-08 20:29:42.361304 Examining 3167 positions from new_analysis2023/bam_normal/SRR8518136.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5782 Unflagged : 3063 Unflagged over 0% freq : 2165 Unflagged over 20% freq : 1199 Median coverage over unflagged variants: 42 Repeat flags: 87 Mapping quality flags: 0 Base quality flags: 76 Strand bias flags: 64 Single variant read flags: 1457 Minor variant flags: 2383 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518136.ord_7924861310.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518140.ord_1648863802.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518140.ord_4566281559.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518140.ord_8940873337.Rdata . Can reuse 45732 calls. Filling in missing normal variants from SRR8518140.ord. 2023-11-08 20:30:34.644956 Examining 3187 positions from new_analysis2023/bam_normal/SRR8518140.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5842 Unflagged : 3155 Unflagged over 0% freq : 2266 Unflagged over 20% freq : 1232 Median coverage over unflagged variants: 38 Repeat flags: 81 Mapping quality flags: 0 Base quality flags: 54 Strand bias flags: 33 Single variant read flags: 1494 Minor variant flags: 2418 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518140.ord_1081257743.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518142.ord_1284121895.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518142.ord_5132220083.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518142.ord_8155551157.Rdata . Can reuse 45662 calls. Filling in missing normal variants from SRR8518142.ord. 2023-11-08 20:31:27.40926 Examining 3188 positions from new_analysis2023/bam_normal/SRR8518142.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5775 Unflagged : 3168 Unflagged over 0% freq : 2256 Unflagged over 20% freq : 1250 Median coverage over unflagged variants: 38 Repeat flags: 70 Mapping quality flags: 0 Base quality flags: 54 Strand bias flags: 36 Single variant read flags: 1471 Minor variant flags: 2338 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518142.ord_493059285.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518147.ord_2083143195.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518147.ord_2974107650.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518147.ord_9796609331.Rdata . Can reuse 45713 calls. Filling in missing normal variants from SRR8518147.ord. 2023-11-08 20:32:15.158514 Examining 3170 positions from new_analysis2023/bam_normal/SRR8518147.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5865 Unflagged : 3141 Unflagged over 0% freq : 2287 Unflagged over 20% freq : 1202 Median coverage over unflagged variants: 43 Repeat flags: 76 Mapping quality flags: 0 Base quality flags: 60 Strand bias flags: 47 Single variant read flags: 1430 Minor variant flags: 2467 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518147.ord_6317722220.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518153.ord_1375931623.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518153.ord_4232118771.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518153.ord_45586153.Rdata . Can reuse 45662 calls. Filling in missing normal variants from SRR8518153.ord. 2023-11-08 20:33:06.649217 Examining 3174 positions from new_analysis2023/bam_normal/SRR8518153.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5698 Unflagged : 3106 Unflagged over 0% freq : 2180 Unflagged over 20% freq : 1190 Median coverage over unflagged variants: 38 Repeat flags: 72 Mapping quality flags: 0 Base quality flags: 68 Strand bias flags: 51 Single variant read flags: 1428 Minor variant flags: 2274 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518153.ord_3570079643.Rdata ..done. saving variants...done. Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518176.ord_3608418052.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518176.ord_4108602661.Rdata . Loading normal variants from new_analysis2023/refNormals/bam/R/qSRR8518176.ord_8866930313.Rdata . Can reuse 45580 calls. Filling in missing normal variants from SRR8518176.ord. 2023-11-08 20:33:53.505817 Examining 3224 positions from new_analysis2023/bam_normal/SRR8518176.ord.bam in 10 batches. 1.2.3.4.5.6.7.8.9.10...........done! done. Variants: 5624 Unflagged : 3150 Unflagged over 0% freq : 2063 Unflagged over 20% freq : 1128 Median coverage over unflagged variants: 26 Repeat flags: 69 Mapping quality flags: 0 Base quality flags: 41 Strand bias flags: 52 Single variant read flags: 1417 Minor variant flags: 2117 Stutter flags: 0 Saving new normal variants back to new_analysis2023/refNormals/bam/R/qSRR8518176.ord_949321122.Rdata ..done. saving variants...done. Adding uncalled variants..Found 2495..Found 4291..Found 4089..Found 4286..Found 4230..Found 4299..Found 4249..Found 4300..Found 4382..done! Importing dbSNP allele frequencies from superFreqResources/dbSNP/hg38/dbAFnew.Rdata...done. Match against variants...done. Importing exac allele frequencies from superFreqResources/dbSNP/hg38/exac.Rdata...done. Match against variants...Saving variants..done. Plotting frequency distributions to new_analysis2023/superfreq2/myAnalysis/plots/SRR8518131/diagnostics/frequencyDistribution/..SRR8518131.ord..SRR8518132.ord..SRR8518134.ord..SRR8518136.ord..SRR8518140.ord..SRR8518142.ord..SRR8518147.ord..SRR8518153.ord..SRR8518176.ord..done. Matching 49962 against 49962 variants..to 49962 variants. Matching 49962 against 49962 variants..to 49962 variants. Flagging variants with large minor variants: 0 ..Flagging variants with large minor variants: 0 ..0 ..0 ..0 ..0 ..0 ..0 ..0 ..0 ..Keeping 44271 out of 49962 (88.6%) SNVs that are present at 5% frequency in at least one sample. Average variant loss is 0.07974219 Flagged 3947 out of 44271 variants that are recurrently and consistently noisy in normals. Flagged another 7309 out of 44271 variants that are not db, but significantly present in at least one normal sample. Not flagging based on abnormal coverage in RNA mode. Flagging variants not above normal background level: 7387..Flagged 230 out of 3474 non-db variants that are consistently polymorphic in normals. Flagging SNPs that are in noisy regions. New flags by sample: 0 done. Marking somatic mutations in Normal.. No matched normal, or normal sample: selecting somatic variants based on population frequencies. Selecting dbSNPs and ExAC below 0.1% population frequency as somatic candidates. These will include rare germline variants, which is desired for normals, but not for cancer samples without matched normals. Salvaging 1 sites that have high frequency in dbSNP or ExAC, but also high frequency in COSMIC. got roughly 4 somatic variants. Trimming uninformative variants by individual...done. Saving final version of combined variants..done. Average variant loss is 0.07974219 2023-11-08 20:42:03.852691 Current memory use (Mb): 1833.2, max use (Mb): 1833.2 Running VariantAnnotation. Loading annotation dump...done. Splitting up 215 variants for parallelisation into 1 batches. Setting up data bases..done. Running annotation by batch.

Many thanks for your work; I appreciate it. Let me know if you need any more information or if I can assist in any way. Thank you!

Regards,

ChristofferFlensburg commented 9 months ago

Hi!

Hmm, the data(entrez2symbol_hs) call that throws the error in 1.5.0 is calling a built in data set belonging to superFreq. I honestly don't know how the back end of data() works, but I'd guess that it has to be either an issue with the superFreq installation that didn't install that data set, or that there is an issue with finding the data set... You can try just

library(superFreq)
data(entrez2symbol_hs)

in a fresh R session, and see if that works. You can also try

data(package='superFreq')

which returns

Data sets in package ‘superFreq’:

entrez2symbol_hs        
entrez2symbol_mm        

for me. Might be clues to the cause of the error in there.

If you can't solve it, then you should be able to install older version of superFreq. Use the ref setting in install_github, and point to the 1.4.5 commit, i believe this one: https://github.com/ChristofferFlensburg/superFreq/commit/99584742099b33310f96a4bfb3a9fd179274d5cb In general I'll be less keen to support older versions, but we all do whatever makes things work, so no judgement if that turns out to be the easiest way to get your data analysed.

Good luck, let me know if you need more help.

ChristofferFlensburg commented 9 months ago

Just an update, I had a local user report the same problem with the missing data set, so I'll be able to look into that closer. Good chance I'll push a fix for that out this week, but no promises ofc.

ChristofferFlensburg commented 9 months ago

Ok, seems I just forgot to git add the data directory to my local repo, so the data didn't get pushed to github. Oopsie. 😬

But it explains why it passed my (local) tests but didn't work for you, or others, and why you didn't get any hits on a search... I pushed the data to live in version 1.5.1, and it worked for at least one other user, so re-install and rerun, and it should work. 🤞

Let me know if it doesn't.