gevaertlab / MethylMix

Identification of differentially methylated genes in biomedical data
13 stars 11 forks source link

Stuck at Finding nonparametric adjustments #2

Open kokyriakidis opened 5 years ago

kokyriakidis commented 5 years ago

Hello, methylmix is stuck at

Done cluster 190189 
    Batch correction for the cancer samples.
Removing 4 samples because their batches are too small.
Reading Sample Information File
Reading Expression Data File
Found 18 batches
Found 0 covariate(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding nonparametric adjustments

for about 30 hours +. Is there any problem? My code is:

library(MethylMix)
library(doParallel)

cancerSite <- "LIHC"

targetDirectory <- getwd()

#Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)

# Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)

# Saving methylation processed data
saveRDS(METProcessedData, file =paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))

# Downloading gene expression data
GEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)

# Processing gene expression data
GEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)

# Saving gene expression processed data
saveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))

# Clustering probes to genes methylation data
METProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])

# Putting everything together in one file
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]], GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))
lucaspatel commented 5 years ago

Hi Konstantinos,

Thanks for reaching out. This is in fact typical - certain cancer types take substantially longer to compute than others, especially on the "Finding nonparametric adjustments" step. If you are able to let the script sit, and have provided sufficient memory, it should eventually complete. If not, let me know and I can see if I still have the processed data files laying around from when I worked on this project.

Sorry for the hassle, solutions have been considered to reduce this unnecessary computation for the user in future versions of MethylMix.

Best,

Lucas

kokyriakidis commented 5 years ago

Hello Lucas,

If you gave the LIHC files, I would be grateful

Best, Konstantinos On Mon, 25 Feb 2019 at 21:57, Lucas Patel notifications@github.com wrote:

Hi Konstantinos,

Thanks for reaching out. This is in fact typical - certain cancer types take substantially longer to compute than others, especially on the "Finding nonparametric adjustments" step. If you are able to let the script sit, and have provided sufficient memory, it should eventually complete. If not, let me know and I can see if I still have the processed data files laying around from when I worked on this project.

Sorry for the hassle, solutions have been considered to reduce this unnecessary computation for the user in future versions of MethylMix.

Best,

Lucas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gevaertlab/MethylMix/issues/2#issuecomment-467159574, or mute the thread https://github.com/notifications/unsubscribe-auth/AgKy3T3aWjJIsd6NncnF_7Z_t0LtBnnkks5vREACgaJpZM4bN3C0 .

lucaspatel commented 5 years ago

Konstantinos,

I will keep you posted.

Best,

Lucas

kokyriakidis commented 5 years ago

Hello, Did you checked it? I noticed that the problem is that despite I use doParallel, methylmix uses only 1 core on my pc. I used cl <- makeCluster(20) registerDoParallel(cl) but it still uses 1 core

sudo-yum-sid commented 4 years ago

Hello, I am using the MethylMix package for Gastric Cancer and I am also facing the same problem. I am also using the doParallel package but it is still taking 1 core. Its have been more than 2 days and the job is still running.

anca-a commented 4 years ago

Hi, I am experiencing the same problem, the script has been running for 66+ hours for one cancer type and I was planning to analyze multiple cancer types. Any workaround recommended?

yi6kim commented 4 years ago

Same issue here for Rectal Cancer.

cancerSite <- "READ" # TCGA Code for rectal cancer targetDirectory <- paste0(getwd(), "/") GetData(cancerSite, targetDirectory)

I tried running the above three lines and it's forever stuck at the 'Finding nonparametric adjustments' in my console

Will this issue be fixed at some point?

wp1g19 commented 4 years ago

Exact same problem here. Its a great shame the package hasn't been updated in the past 7 months or so. 1 core

wp1g19 commented 4 years ago

I ran this for 9 days straight, it did not complete. Please fix your package.

hamma2 commented 3 years ago

So regarding the parallelization of the non-parametric adjustments: I added a foreach loop in the ComBat_NoFiles.R. It starts a thread for each batch of your data. So 16 batches = 16 Threads. Probably a better way would be to parallelize the Monte Carlo implementation directly, but I am no expert so I'll leave this to someone else. If this find approval, something like it hopefully makes it into the bioconductor git.

    final_temp<-foreach::foreach(i = 1:n.batch, .combine=rbind, .export = c("int.eprior")) %dopar% {
        temp <- int.eprior(as.matrix(s.data[,batches[[i]]]),gamma.hat[i,],delta.hat[i,])
        temp
    }

    # split the results from parallel computing
    for(i in 1:nrow(final_temp)){
        if(rownames(final_temp)[i]=="g.star"){
            gamma.star <- rbind(gamma.star,final_temp[i,])
        }else{
            delta.star <- rbind(delta.star,final_temp[i,])
        }
    }
hamma2 commented 3 years ago

Alright so I think I figured a way to parallelize the Monte Carlo function. I think this should do the trick:


int.eprior <- function(sdat,g.hat,d.hat){
    g.star <- d.star <- NULL
    r <- nrow(sdat)
    print(paste0(r,'\n'))
    adj<-foreach::foreach(i = 1:r, .combine = cbind) %dopar% {
        g <- g.hat[-i]
        d <- d.hat[-i]      
        x <- sdat[i,!is.na(sdat[i,])]
        n <- length(x)
        j <- numeric(n)+1
        dat <- matrix(as.numeric(x),length(g),n,byrow=T)
        resid2 <- (dat-g)^2
        sum2 <- resid2%*%j
        LH <- 1/(2*pi*d)^(n/2)*exp(-sum2/(2*d))
        LH[LH=="NaN"]=0
        g.star <- sum(g*LH)/sum(LH)
        d.star <- sum(d*LH)/sum(LH)
        #if(i%%1000==0){print(paste0(i,'\n'))}
        return(list(g.star,d.star))
    }
    adjust <- rbind(unlist(adj[1,]),unlist(adj[2,]))
    rownames(adjust) <- c("g.star","d.star")
    adjust  
} 
ayueme commented 3 years ago

Hi, I am experiencing the same problem. Will this problem be solved?

pdufog commented 2 years ago

Any update on this problem? It must have worked before since there are multiple publications using this package.

yangxue244 commented 1 year ago

Alright so I think I figured a way to parallelize the Monte Carlo function. I think this should do the trick:

int.eprior <- function(sdat,g.hat,d.hat){
    g.star <- d.star <- NULL
    r <- nrow(sdat)
    print(paste0(r,'\n'))
    adj<-foreach::foreach(i = 1:r, .combine = cbind) %dopar% {
        g <- g.hat[-i]
        d <- d.hat[-i]        
        x <- sdat[i,!is.na(sdat[i,])]
        n <- length(x)
        j <- numeric(n)+1
        dat <- matrix(as.numeric(x),length(g),n,byrow=T)
        resid2 <- (dat-g)^2
        sum2 <- resid2%*%j
        LH <- 1/(2*pi*d)^(n/2)*exp(-sum2/(2*d))
        LH[LH=="NaN"]=0
        g.star <- sum(g*LH)/sum(LH)
        d.star <- sum(d*LH)/sum(LH)
        #if(i%%1000==0){print(paste0(i,'\n'))}
        return(list(g.star,d.star))
    }
    adjust <- rbind(unlist(adj[1,]),unlist(adj[2,]))
    rownames(adjust) <- c("g.star","d.star")
    adjust    
} 

hello I have tried this method. But an error comes out: Error in unserialize(socklist[[n]]): error reading from connection or Error in serialize(data, node$con, xdr = FALSE): error writing to connection

27k dataset was preprocessed, but this error comes when 450k dataset being preprocessed halfway.

I have tried 2-3 times >< PLEASEEEEEEE help me