LTLA / csaw

Clone of the Bioconductor repository for the csaw package.
https://bioconductor.org/packages/devel/bioc/html/csaw.html
3 stars 3 forks source link

filterWindowsGlobal() function running is extremely long #16

Open RacheliRap opened 3 years ago

RacheliRap commented 3 years ago

Hi, I'm trying for the first time to use the package. I want to analysis H3k27me3 CHIP-Seq data, and therefore I use the same parameters shown in the tutorial for this histone modification. When I run the function filterWindowsGlobal for filtering of low-abundance windows, it takes more than 12 hours. Any idea way? my bam file contain ~5 million reads each. My code:

library(BiocFileCache)
library(csaw)

## read the metadata file
h3k27me3data <- 
  read.csv("../CSAW_df.csv")
h3k27me3data <- data.frame(apply(h3k27me3data, 2, as.character), stringsAsFactors = F)

## define the blacklist using the predicted repeats from the RepeatMasker software
bfc <- BiocFileCache("local", ask=FALSE)
black.path <- bfcrpath(bfc, file.path("http://hgdownload.cse.ucsc.edu",
                                      "goldenPath/mm10/bigZips/chromOut.tar.gz"))
tmpdir <- tempfile()
dir.create(tmpdir)
untar(black.path, exdir=tmpdir)

## Iterate through all chromosomes.
collected <- list()
for (x in list.files(tmpdir, full=TRUE)) {
  f <- list.files(x, full=TRUE, pattern=".fa.out")
  to.get <- vector("list", 15)
  to.get[[5]] <- "character"
  to.get[6:7] <- "integer"
  collected[[length(collected)+1]] <- read.table(f, skip=3, 
                                                 stringsAsFactors=FALSE, colClasses=to.get)
}

collected <- do.call(rbind, collected)
blacklist <- GRanges(collected[,1], IRanges(collected[,2], collected[,3]))
blacklist

## minimum mapping quality score to 10. 
## We also restrict ourselves to the standard chromosomes
param <- readParam(minq=10, discard=blacklist,
                   restrict=paste0("chr", c(1:19, "X", "Y")))

## Counting reads into windows
win.data <- windowCounts(h3k27me3data$Path, param=param, width=2000,
                         spacing=500, ext=200)

bins <- windowCounts(h3k27me3data$Path, bin=TRUE, width=10000, param=param)
filter.stat <- filterWindowsGlobal(win.data, bins)

Thank you!!

LTLA commented 3 years ago

I don't know. There is nothing particularly complex inside the function that would cause it to behave that way. Perhaps you could debug(filterWindowsGlobal) and step through the function (press Enter to go through line-by-line) and identify the step that stalls.