ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

simpleLoess error #67

Closed mkinnaman closed 1 month ago

mkinnaman commented 4 years ago

Hi Christoff-

I am getting a new error with my latest run that I am not sure how to troubleshoot. Any thoughts?

Error in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, : invalid 'x' Calls: superFreq ... runDE -> lapply -> lapply -> FUN -> loess -> simpleLoess

ChristofferFlensburg commented 4 years ago

Hey. THe error message isn't very informative, I need the logfile in R/[individual]/runtimeTracking.log. Note that this file includes your metadata, but no identifying variants.

mkinnaman commented 4 years ago

Hi Christoff,

I think it has something to do with one particular sample needing MACorrection? It crashes while trying to correct for binding strength bias in sample T7. I am not savvy enough to figure out how to turn on MACorrection in my Rscript to see if that changes anything...

Runtimetracking.log is attached for the run. I had turned on forceredoeverything to see if it fixed anything but the crash happened at the same timepoint. runtimeTracking.log

ChristofferFlensburg commented 4 years ago

Hi!

Thanks for sending through, and sorry for delay. Busy days this week... I don't see any red flags in the log files, except that you're not running on the very latest version of superFreq. I have a couple offline bug fixes that I want to run through the testing and push online (testing runs overnight as I am doing a test-run on genomes), and I'll have you rerun on the new version once available. So hang on another day, and I'll come back to you tomorrow (hopefully) with a fresh version to try with. If the new version still fails, then I'll ask you to send me the counts matrix so I can reproduce the error locally, but with some luck we won't have to go there.

You may be the first person to try samtools 1.11 as well. There was an issue in samtools 1.10 that swapped column in the output compared to 1.9 and earlier that I had to react to, and they said it'd be fixed in 1.11 and I told superFreq to use the old column order for 1.11+, so hopefully that should work, but keep an eye out for warnings/errors during variant calling when it gets to that.

mkinnaman commented 4 years ago

Great - looking forward to the update and as always appreciate the prompt and thorough response!

ChristofferFlensburg commented 4 years ago

Tests failed because of a typo on my end, %in$ instead of %in%. 🙄 You'll have to wait one more day it seems. Oof.

ChristofferFlensburg commented 4 years ago

Ok, tests passed well enough. One sample failed, but that's because I had deleted the reference normals, so I'll forgive superFreq for that. I got an informative error message at lest. :D

Try updating to 1.4.2 and rerun. If you get the same error again, can you send me the count matrix located in the Rdirectory, fCsExon.Rdata) and I'll try to reproduce. I'm taking a couple days off next week though, so may be delays.

Enjoy weekend!

mkinnaman commented 4 years ago

Chirstoff,

Couple of issues. When I tried to install the new update - I got the following error: ERROR: this R is version 3.6.2, package 'superFreq' requires R >= 4.0.0 Error: Failed to install 'superFreq' from GitHub: (converted from warning) installation of package ‘/tmp/RtmprXdLq3/file12fc5b8fd467/superFreq_1.4.2.tar.gz’ had non-zero exit status

I then started fresh in a new environment with R-4, installed it without issue, and during trial runs on two different datasets got the same errors:

|| Process BAM file I-H-134708-N3-1-D1-1.bam... || ERROR: Paired-end reads were detected in single-end read library : /juno/work/isabl/data/analyses/65/73/166573/I-H-134708-N3-1-D1-1.bam No counts were generated. Error in featureCounts. Input was bamFiles: /juno/work/isabl/data/analyses/65/73/166573/I-H-134708-N3-1-D1-1.bam /juno/work/isabl/data/analyses/41/48/224148/IID_H134708_T11_01_WG01.bam /juno/work/isabl/data/analyses/18/78/231878/IID_H134708_T15_01_WG01.bam /juno/work/isabl/data/analyses/18/79/231879/IID_H134708_T16_01_WG01.bam /juno/work/isabl/data/analyses/65/67/166567/I-H-134708-T5-1-D1-1.bam /juno/work/isabl/data/analyses/65/70/166570/I-H-134708-T7-1-D1-1.bam /juno/work/isabl/data/analyses/65/71/166571/I-H-134708-T8-1-D1-1.bam captureAnnotation[1:10,]: ? WASH7P WASH7P FAM138A ? OR4G4P ? ? ? ENST00000466430 1 10000 20000 30000 40000 50000 60000 70000 80000 90000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 * * * * * * * * * * 1 1 1 1 1 1 1 1 1 1

runtimeTracking.log

ChristofferFlensburg commented 4 years ago

Ok, yep, the version error is expected. I added the 4.0.0 requirement as I've only been testing and supporting on that for a while, so high time to make it an official requirement.

The error is from featureCounts and how it handles single/paired end reads. In short, with paired end reads, featureCounts tries to re-sort the entire bamfile, which is prohibitively slow for genomes, so superFreq tells featureCounts to ignore read pairs and just count reads for genomes, which cuts it down to manageable runtime. This has been the case for years and has been working fine, and the featureCount documentation also indicate that running as single end on paired end data should be ok. The fact that you get an error could be due to

1) You have an older or newer version of featureCounts that behaves differently. 2) There is something wrong or unusual with the BAM file. 3) some other error elsewhere, such as bug in superFreq code.

I see in the first logfile that you passed this step (in genome mode, but other samples) in earlier versions of R and superFreq, so to try to narrow down the problem, could you:

1) check version of featureCounts. If not the latest, update to latest and try again. 2) If the same error persists, try running on one of the old genome BAM files that did work earlier.

cheers, good luck!

mkinnaman commented 4 years ago

Christoff,

My inclination is that it might be a bug in superfreq. I have had no trouble running old superfreq on the same bams that are giving me an issue with the most current release.

ChristofferFlensburg commented 4 years ago

As this is a conflict with featureCounts, I need to know which version of featureCounts (Rsubread package, look in sessionInfo()) you are running so that I can try to reproduce it. The latest version ran on the genome in our local tests, so not useful for debugging. Can you check your version of featureCounts please, and update if not the latest? Thanks.

mkinnaman commented 4 years ago

I am running the most current release of Rsubread:

'R version 4.0.0 (2020-04-24) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /home/kinnamam/miniconda3/envs/superfreq3/lib/libopenblasp-r0.3.10.so

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] superFreq_1.4.2 BSgenome.Hsapiens.UCSC.hg38_1.4.3 [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome.Mmusculus.UCSC.mm10_1.4.0 [5] BSgenome_1.56.0 rtracklayer_1.50.0
[7] MutationalPatterns_3.0.0 NMF_0.23.0
[9] Biobase_2.50.0 cluster_2.1.0
[11] rngtools_1.5 pkgmaker_0.32.2
[13] registry_0.5-1 limma_3.46.0
[15] Rsubread_2.4.0 R.oo_1.24.0
[17] R.methodsS3_1.8.1 Rsamtools_2.6.0
[19] Biostrings_2.58.0 XVector_0.30.0
[21] biomaRt_2.46.0 GenomicRanges_1.42.0
[23] GenomeInfoDb_1.26.0 IRanges_2.24.0
[25] S4Vectors_0.28.0 BiocGenerics_0.36.0
[27] WriteXLS_6.0.0

loaded via a namespace (and not attached): [1] bitops_1.0-6 matrixStats_0.57.0
[3] bit64_4.0.5 doParallel_1.0.16
[5] RColorBrewer_1.1-2 progress_1.2.2
[7] httr_1.4.2 tools_4.0.0
[9] R6_2.5.0 DBI_1.1.0
[11] colorspace_1.4-1 withr_2.3.0
[13] tidyselect_1.1.0 prettyunits_1.1.1
[15] bit_4.0.4 curl_4.3
[17] compiler_4.0.0 xml2_1.3.2
[19] DelayedArray_0.16.0 scales_1.1.1
[21] askpass_1.1 rappdirs_0.3.1
[23] stringr_1.4.0 digest_0.6.27
[25] pkgconfig_2.0.3 MatrixGenerics_1.2.0
[27] dbplyr_1.4.4 rlang_0.4.8
[29] RSQLite_2.2.1 generics_0.0.2
[31] BiocParallel_1.24.0 dplyr_1.0.2
[33] VariantAnnotation_1.34.0 RCurl_1.98-1.2
[35] magrittr_1.5 GenomeInfoDbData_1.2.4
[37] Matrix_1.2-18 Rcpp_1.0.5
[39] munsell_0.5.0 lifecycle_0.2.0
[41] stringi_1.5.3 ggalluvial_0.12.2
[43] SummarizedExperiment_1.20.0 zlibbioc_1.36.0
[45] plyr_1.8.6 BiocFileCache_1.14.0
[47] grid_4.0.0 blob_1.2.1
[49] crayon_1.3.4 lattice_0.20-41
[51] GenomicFeatures_1.40.0 hms_0.5.3
[53] pillar_1.4.6 reshape2_1.4.4
[55] codetools_0.2-16 XML_3.99-0.5
[57] glue_1.4.2 vctrs_0.3.4
[59] foreach_1.5.1 gtable_0.3.0
[61] openssl_1.4.3 purrr_0.3.4
[63] assertthat_0.2.1 ggplot2_3.3.2
[65] gridBase_0.4-7 xtable_1.8-4
[67] pracma_2.2.9 tibble_3.0.4
[69] iterators_1.0.13 GenomicAlignments_1.26.0
[71] AnnotationDbi_1.52.0 memoise_1.1.0
[73] ellipsis_0.3.1 `

ChristofferFlensburg commented 4 years ago

Thanks. That version worked in our tests, so not sure where the error is. I checked the featureCounts documentaiton and doesn't seem anything has changed in the way superFreq is using it.

We can try to isolate the crashing featureCounts call:

#laod superFreq, which includes Rsubread
library(superFreq)

#load in the capture regions, which is 10kbp bins for genomes
load('/juno/res/iacobuzc/share/kinnaman/superfreq/OSCE4_Superfreq/R/OSCE4/captureRegions.Rdata')
#transform to featureCounts input format
captureAnnotation = superFreq:::captureRegionToAnnotation(captureRegions)

#check that annotation is ok
#should be something like
#            GeneID Start    End Strand Chr
#1                ?     1  10000      *   1
#2           WASH7P 10000  20000      *   1
#3           WASH7P 20000  30000      *   1
#4          FAM138A 30000  40000      *   1
#5                ? 40000  50000      *   1
captureAnnotation[1:5,]

bamFile = "/juno/work/isabl/data/analyses/65/73/166573/I-H-134708-N3-1-D1-1.bam"

#run featureCounts. This is the crash.
fc = featureCounts(bamFile, annot.ext=captureAnnotation, useMetaFeatures=F, allowMultiOverlap=T, isPairedEnd= F, minMQS=10, nthreads=10)

#to exclude issues with annotation, try this:
fc = featureCounts(bamFile, annot.inbuilt="hg19", useMetaFeatures=F, allowMultiOverlap=T, isPairedEnd= F, minMQS=10, nthreads=10)

Can you confirm that those crash with the same error?

mkinnaman commented 4 years ago

Thanks Christoff,

I can confirm I did as you described above and recreated the same error:

ERROR: Paired-end reads were detected in single-end read library : /juno/work/isabl/data/analyses/65/73/166573/I-H-134708-N3-1-D1-1.bam No counts were generated. Execution halted

I also ran the same commands in my older superfreq setup using R3.6 and Rsubread2.0.1 and it ran without issue.

So I am thinking its either a bug with SF or something with version control between R and Rsubread.

ChristofferFlensburg commented 4 years ago

Hi!

Sorry for the delay.

Hmm, seems it's a Rsubread version conflict with your BAM file. :/ The last featureCounts call doesn't have anything to do with superFreq, so very difficult for me to troubleshoot. You would have to ask the Rsubread devs to look into it. Sorry I can't help with this, hope you understand.

As workaround, you could make a branch of superFreq and relax the R 4.0.0 requirement (in DESCRIPTION, "Depends"), and then you can run on R 3.6 and backdate Rsubread but still use the latest superFreq version. I'm hesitant to relax it on the official release, as quite a lot changed in R 4.0.0 and I am not testing on older versions any longer.

If it's only this one run that is causing you trouble, you can also use the count file (fCsExon.Rdata) from the R 3.6 run and copy that into the R directory of the R 4.0.0 run, and superFreq will reuse the count matrix without rerunning featureCounts. Just make sure the two runs use exactly the same meta data, or there will be conflicts. That solution of course doesn't scale well if you have more BAM files with this issue...

I am not really sure how to solve this otherwise, let me know how it goes!

mkinnaman commented 3 years ago

Christoff,

I forked superfreq and made the modifications to DESCRIPTION to relax the R4.0 requirement which fixed the issues I was having with the newest update. I tried to rerun on the set of samples that triggered the error that started the initial post and got the same error.

Would turning on "loess normalising" for genomes maybe fix the problem? If so how would I do that?

Thanks, Michael

ChristofferFlensburg commented 3 years ago

Hi!

Ok, so we are finally back to the original problem! 🎉 I will keep an eye out for the featureCounts issue in R 4.0.0+ though...

If you can you send me the counts matrix from the analysed samples (something like R/myIndividual/fCexon.Rdata) and from the reference normals (something like normalDirectory/R/fCsExon.Rdata), then I can try to reproduce the bug. Maybe you can also have a look at and maybe post some of the MA plots (in plots/myIndividual/diagnostics/MAplots).

I'd be surprised if turning on loess normalisation for genomes would fix it, but as I don't know what causes the bugs, who knows... It'd be a check for mode == "genome" somewhere close to where the log entry is made that you could change, but if I were you I wouldn't bother with that and just send the counts matrices.

Oh and send the log file from the latest run as well please, thanks!

mkinnaman commented 3 years ago

SF_OSCE3.zip Christoff -

Many apologies in the delay of getting these files to you. Back working on this project. I have attached all the requested files.

Please let me know if you need anything else.

Kind Regards, Michael

mkinnaman commented 3 years ago

Christoff- Any thoughts on the above?