BCM-Lupskilab / HMZDelFinder

CNV calling algorithm for detection of homozygous and hemizygous deletions from whole exome sequencing data
GNU General Public License v2.0
11 stars 12 forks source link

calcRPKMsFromBAMs() #2

Closed NJ2012 closed 9 months ago

NJ2012 commented 7 years ago

Dear sir, madam, I keep running into the following problem. It is not to me clear what this means and how to resolve. I hope you can help me further.

command: calcRPKMsFromBAMs(bedFile,bamFiles , sampleNames, rpkmDir,4)

In fread(bedFile) : Bumped column 1 to type character on data row 213577, field contains 'X'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. 2: In mclapply(1:length(bamFiles), function(i) { : all scheduled cores encountered errors in user code

tgambin commented 7 years ago

Assuming that you have already defined the following objects: bedFile, bamFiles, sampleNames, outputDir Could you please, try to run the following lines one by one and see if/when you get error?:

`

    library(Rsubread)
library(data.table)
library(parallel)
bed <- fread(bedFile)
df <- data.frame(cbind(1:nrow(bed), bed))
colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")
if (!file.exists(outputDir)){dir.create(outputDir)}

    i <- 1
    file <- bamFiles[i]
prefix <- sampleNames[i]
outputFile <- paste0(outputDir, prefix, "_rpkm2.txt")
res <-  featureCounts(files=file, annot.ext=df, allowMultiOverlap=TRUE, nthreads=3)
total <- sum(res$stat[2])
counts <- data.frame(res$counts)
colnames(counts) <- "count"
counts$RPKM <-  as.numeric(1e9 * as.numeric(counts$count)) / as.numeric(as.numeric(df$End - df$Start) * as.numeric(total))
write.table(counts, file=outputFile, sep="\t", col.names=T, row.names=F, quote=F)`
NJ2012 commented 7 years ago

Hi, Thank you for your help. I get the error with the following command:

bed <- fread(bedFile)

fread(bedFile) : Bumped column 1 to type character on data row 213577, field contains 'X'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

From: tgambin notifications@github.com Reply-To: BCM-Lupskilab/HMZDelFinder reply@reply.github.com Date: Tuesday, 21 March 2017 at 14:22 To: BCM-Lupskilab/HMZDelFinder HMZDelFinder@noreply.github.com Cc: "Lahrouchi, Najim" NLAHROUCHI@mgh.harvard.edu, Author author@noreply.github.com Subject: Re: [BCM-Lupskilab/HMZDelFinder] calcRPKMsFromBAMs() (#2)

i <- 1

file <- bamFiles[i]

prefix <- sampleNames[i]

outputFile <- paste0(outputDir, prefix, "_rpkm2.txt")

res <- featureCounts(files=file, annot.ext=df, allowMultiOverlap=TRUE, nthreads=3)

total <- sum(res$stat[2])

counts <- data.frame(res$counts)

colnames(counts) <- "count"

counts$RPKM <- as.numeric(1e9 as.numeric(counts$count)) / as.numeric(as.numeric(df$End - df$Start) as.numeric(total))

write.table(counts, file=outputFile, sep="\t", col.names=T, row.names=F, quote=F)`

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

NJ2012 commented 7 years ago

The example on Github work 1000G data work fine by the way.

From: tgambin notifications@github.com Reply-To: BCM-Lupskilab/HMZDelFinder reply@reply.github.com Date: Tuesday, 21 March 2017 at 14:22 To: BCM-Lupskilab/HMZDelFinder HMZDelFinder@noreply.github.com Cc: "Lahrouchi, Najim" NLAHROUCHI@mgh.harvard.edu, Author author@noreply.github.com Subject: Re: [BCM-Lupskilab/HMZDelFinder] calcRPKMsFromBAMs() (#2)

i <- 1

file <- bamFiles[i]

prefix <- sampleNames[i]

outputFile <- paste0(outputDir, prefix, "_rpkm2.txt")

res <- featureCounts(files=file, annot.ext=df, allowMultiOverlap=TRUE, nthreads=3)

total <- sum(res$stat[2])

counts <- data.frame(res$counts)

colnames(counts) <- "count"

counts$RPKM <- as.numeric(1e9 as.numeric(counts$count)) / as.numeric(as.numeric(df$End - df$Start) as.numeric(total))

write.table(counts, file=outputFile, sep="\t", col.names=T, row.names=F, quote=F)`

The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.

tgambin commented 7 years ago

This is just a warning and should not affect the results. Could you please try to execute next lines?

Rt-2017 commented 7 years ago

Hi there, This is regarding the calcRPKMsFromBAMs() function. I keep on getting the following error message : stack smashing detected : /opt/apps/r/3.3.3/lib/R/bin/exec/R terminated I have been struggling with this for days and I just can't seem to figure it out. I ran the code line by line and everything seems to be working fine before this part. The path to my bam files and bed file, I have checked everything. The required packages are loaded as well. I would really appreciate some help.

Thanks

Hoxus commented 7 years ago

Hi,

I tested HMZ, sadly with too few exome. I think I already seen this (last) error, but I can't remember how I solved it. @Rt-2017: 1/ could you post your commands ? 2/ example file works fine ?

Thx

Rt-2017 commented 7 years ago

Hey, Thanks for responding. I have a number of bam files and I need to calculate the rpkm. Yes the example works fine,but in the example rpkm is already calculated. Just have to download and run the code. For me, I need to do from the scratch,calculate the rpkm and then move forward. Unfortunately,I am stuck there :( . So I specified the path to my bam and bed files and then from the source code function calculate rpkm.

pathToBams <- '/path/to/bam/' bamFiles <- paste0(pathToBams, dir(pathToBams, "bam$")) pathToBed <- '/path/to/bed/' bedFile <- paste0(pathToBed, dir(pathToBed, "bed$")) rpkmDir <- dataDir
sampleNames <- sapply(strsplit(dir(pathToBams, "bam$"), "[/\.]"), function(x){x[length(x)-1]}) calcRPKMsFromBAMs(bedFile,bamFiles,sampleNames,rpkmDir,4)

I tried running line by line the calcRPKMsFromBAMs() function from the source code and as soon as I reach to this particular line,the program terminates.

res <- featureCounts(files=file, annot.ext=df, allowMultiOverlap=TRUE, nthreads=3)

I am guessing something to do with the data frame for the bed file but I am not sure. Any thoughts?

Thanks

tgambin commented 7 years ago

Do you have Rsubread package installed? Also could you print few lines from df object?

Rt-2017 commented 7 years ago

Yes I have it installed. As for my bed file,the first column for the geneID is as follows

469_358463_79501(OR4F5)_1 469_360600_729759(OR4F29)_1 469_358708_81399(OR4F16)_1 473_403203_148398(SAMD11)_3 473_403204_148398(SAMD11)_4 473_403205_148398(SAMD11)_5 473_403206_148398(SAMD11)_6 473_403207_148398(SAMD11)_7 473_403208_148398(SAMD11)_8 473_403209_148398(SAMD11)_9 473_403210_148398(SAMD11)_10 473_403211_148398(SAMD11)_11 473_403212_148398(SAMD11)_12 473_403213_148398(SAMD11)_13 473_403214_148398(SAMD11)_14 469_357140_26155(NOC2L)_19

Could this be the problem?

tgambin commented 7 years ago

I think the first column should be chromosome followed by start, stop. gene name should be in fourth column

Rt-2017 commented 7 years ago

colnames(df) <- c("GeneID", "Chr", "Start", "End", "Strand")

This is as per the source code,but I think the order does not matter based on what I read from here https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts

My worry is with the format for the GeneID, can it take both character strings and integers or just one?

tgambin commented 7 years ago

Yes, sorry you are right. Could you try to replace this column with something else. Eg

df$GeneID <- 1:nrow (df)

Rt-2017 commented 7 years ago

Ok, so I add this extra bit in my code ? I am sorry,I am a little lost here :(

tgambin commented 7 years ago

Yes please add this line before executing featureCounts function

Rt-2017 commented 7 years ago

I tried,still getting the same error :( stack smashing detected : /opt/apps/r/3.3.3/lib/R/bin/exec/R terminated Aborted

tgambin commented 7 years ago

So as you said it looks the problem is with featureCounts. Maybe you can try to check if other people had similar error, see eg

https://www.biostars.org/p/168962/

It looks that some people had problems with bam headers. Also I don't remember now but index files could be required so make sure they are in the same folder as bam files.

Alternatively you can try to compute rpkm files using other software.

Rt-2017 commented 7 years ago

Yes the index files are required and I have them in the same folder. I guess it's the bed file anyways I will check the link. Thanks for your time :)

vmrosario commented 7 years ago

Hello I've been trying to run hmzdelfinder on our data and when I try to generate rpkm files I get the following error:

caught segfault address (nil), cause 'unknown'

Traceback: 1: .C("R_readSummary_wrapper", as.integer(n), as.character(cmd), PACKAGE = "Rsubread") 2: featureCounts(files = file, annot.ext = df, allowMultiOverlap = TRUE, nthreads = 3) 3: FUN(X[[i]], ...) 4: lapply(X = S, FUN = FUN, ...) 5: doTryCatch(return(expr), name, parentenv, handler) 6: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 7: tryCatchList(expr, classes, parentenv, handlers) 8: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 9: try(lapply(X = S, FUN = FUN, ...), silent = TRUE) 10: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) 11: FUN(X[[i]], ...) 12: lapply(seq_len(cores), inner.do) 13: mclapply(1:length(bamFiles), function(i) { gc() gc() file <- bamFiles[i] prefix <- sampleNames[i] outputFile <- paste0(outputDir, prefix, "_rpkm2.txt") res <- featureCounts(files = file, annot.ext = df, allowMultiOverlap = TRUE, nthreads = 3) total <- sum(res$stat[2]) counts <- data.frame(res$counts) colnames(counts) <- "count" counts$RPKM <- as.numeric(1e+09 as.numeric(counts$count))/as.numeric(as.numeric(df$End - df$Start) as.numeric(total)) write.table(counts, file = outputFile, sep = "\t", col.names = T, row.names = F, quote = F)}, mc.cores = mc.cores) 14: calcRPKMsFromBAMs(bedFile, bamFiles, sampleNames, rpkmDir, 4)

Am I doing something wrong? How can I generate RPKMs?

tgambin commented 7 years ago

It looks like the problem is in featureCounts function from Rsubread package. Could you please, provide first few lines of your bed file?

vmrosario commented 7 years ago

The bed file looks like this:

chr start stop name chr1 14361 14829 WASH7P chr1 14969 15038 WASH7P chr1 15795 15947 WASH7P chr1 16606 16765 WASH7P chr1 16857 17055 WASH7P chr1 17232 17368 WASH7P chr1 17605 17742 WASH7P chr1 17914 18061 WASH7P chr1 18267 18366 WASH7P

tgambin commented 7 years ago

Please, remove header line and make sure the file is tab separated, and try to re-run the script.

NJ2012 commented 7 years ago

Hi, Is it possible to provide a an example that runs fully on your exam data? I keep getting the same error: command: calcRPKMsFromBAMs(bedFile,bamFiles , sampleNames, rpkmDir,4)

In fread(bedFile) : Bumped column 1 to type character on data row 213577, field contains 'X'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE. 2: In mclapply(1:length(bamFiles), function(i) { : all scheduled cores encountered errors in user code

An full example would be helpful. Kind regards and thank you.

tgambin commented 7 years ago

OK, I will prepare full working example that works on small BAM files. I will get back to you soon.

vmrosario commented 7 years ago

Hello, I've removed the header and I still get the same error. Is there another way to generate the rpkm files? Can I use conifer to generate it, because I can't get to work.

tgambin commented 7 years ago

Please, see minimal working example for calcRPKMsFromBAMs():

https://gist.github.com/tgambin/f43937ce98a0300b11a15647e24f2318

I used toy BAM data available in WES.1KG.WUGSC package. As a next step I will integrate this example with the rest of the pipeline and then I will move it to the main repository.

Regarding other method to calculate RPKMs. Yes, you can use Conifer script, however you may get some additional false positive calls in small exons (Conifer counts only reads that start within the target region, so may report zero reads for exons that are actually well covered).

Rt-2017 commented 6 years ago

Hi there, I have been trying to run this algorithm for some time,unfortunately I keep on getting a lot of errors. I was not able to calculate the RPKM as I kept on getting the error message as stack smashing detected : /opt/apps/r/3.3.3/lib/R/bin/exec/R terminated. I guess my bam files were quite big because it worked when I tried with one chr from a single file. So I opted to calculated the RPKM with Conifer, but when I use this RPKM with HMZDelFinder, I get a new error: Error in rownames<-(*tmp*, value = c("edit-1", "edit-10", "edit-11", : attempt to set 'rownames' on an object with no dimensions

I double checked the format of the rpkm files,it has 2 columns with count and rpkm as per the example. I am not able to locate the error, is it the files or the values calculated by Conifer?

tgambin commented 6 years ago

Could you send me few lines of your file RPKM file ?

Rt-2017 commented 6 years ago

So Conifer calculated this PTgerald_HFLK7BBXX_1_AACAACCG-AACAACCG_recal_FINAL.bam.rpkm.txt

And this is the edited one which I use edit-1.rpkm2.txt

tgambin commented 6 years ago

At which stage you got this error ? i.e. is it after "Reading RPKM files ..."

Rt-2017 commented 6 years ago

yes ... "Removing empty elements ..." "Creating matrix ..." and then the error appears

tgambin commented 6 years ago

Ok I think the problem is that second column in your txt file should be RPKM (not rpkm). Could you please check and let me know if this helps ?

Rt-2017 commented 6 years ago

yes.... :)...thank you so much.. I really feel dumb right now cuz of that error... :P

tgambin commented 6 years ago

no problem :) hope it helps. Let me know if you were able to obtain final results.

Rt-2017 commented 6 years ago

Yeah...I got my results but I did it without AOH analysis...just wanted to make sure the pipeline runs without any errors before complicating it ...thanks alot for your help :)

tgambin commented 6 years ago

Thanks for the update. Some of these results maybe already useful, i.e. hemizygous deletions in males and homozygous deletions with low z-RPKM score, but it is worth trying to add AOH filtering step.

Let me know if you need any help in the future.

Hoxus commented 6 years ago

@Rt-2017

Hi there, This is regarding the calcRPKMsFromBAMs() function. I keep on getting the following error message : stack smashing detected : /opt/apps/r/3.3.3/lib/R/bin/exec/R terminated I have been struggling with this for days and I just can't seem to figure it out. I ran the code line by line and everything seems to be working fine before this part. The path to my bam files and bed file, I have checked everything. The required packages are loaded as well. I would really appreciate some help.

Just to say:

I already seen this error ; and once again, few days ago, when I tested CNV detection in a new pool of data. I investigated a lot (I think), and it not come from the bed file: all previously OK bed file I tested cause bug. I seen this bug only on some bam files, always the same. I don't know why.

I upgrade Rsubread from 1.26 to 1.28 (last release), and it work !

Rt-2017 commented 6 years ago

Thanks , I will try it. I tried with my data a couple of months ago and I had around 300 bam files so I used conifer to calculate rpkm.

Rt-2017 commented 6 years ago

@tgambin

Am trying to calculate AOH, and the code for reading vcf is not working. I have one compressed vcf which has all the samples . Do I need to have individual files?

tgambin commented 6 years ago

Yes, the current code requires VCFs for single samples