colomemaria / epiAneufinder

R package to detect breakpoints and assign somies to scATAC-seq data
GNU General Public License v3.0
31 stars 6 forks source link

Issues with "Correcting for GC bias" step #22

Closed ngdog7 closed 4 months ago

ngdog7 commented 5 months ago

Hello, I am excited to use your package.

I have been working with the provided dataset (sample.tsv) and successfully ran the script up to the GC bias step. However, I am unable to complete this specific step as the script seems to be stuck:

"corrected_counts <- peaks[, mclapply(.SD, function(x) { LOESS correction for GC fit <- stats::loess(x ~ peaks$GC) correction <- mean(x) / fit$fitted as.integer(round(x * correction)) }, mc.cores = 4), .SDcols = patterns("cell-")]"

Have you encountered this problem before?

Thanks, Colin

thek71 commented 5 months ago

Hi Colin,

so far we have not seen any problem like that with the sample dataset. The GC correction step is quite intensive computationally and it can happen if the dataset is quite big (like 10k cells) to take a day or so to complete, depending also on the resources that were given. Can you please tell me the memory and number of cores you are using? Is there an error produced or i just stops without any errors?

Best, Katia

ngdog7 commented 5 months ago

Hi Katia,

I appreciated the quick response and the feedback. I'm currently using 64GB RAM and 32 cores. The algorithm keeps running without stopping (>12 hours) but i'm worried there may be another issue.

To help with my troubleshooting, would you mind sharing the average time for the GC correction step on the sample dataset with your computational resources.

Thanks again for the support, Colin

ngdog7 commented 5 months ago

Solved! Thank you. It took about 2-3 hours with 128GB Ram and 32 cores.

thek71 commented 4 months ago

Hi Colin,

I just run the GC correction of the sample file on my laptop, using just one core and 16GB RAM and it took 1.5 minutes. I think you should check the cluster configuration or something like that, because for just 16 cells as is the sample, it cannot not require so much time and resources.

I am closing the issue now, but if you have any other questions come back ;).

Best, Katia

jojobrew commented 3 months ago

Hi, I have the same issue where it stalls on "Correcting for GC bias". It's very strange because it works on my local machine, but not in an HPC environment, so I'm confused why that's the case, despite me specifying very large memory and # CPU requirements. Even on the demo run with the sample dataset, which should be very quick, sometimes it fails (as in never completes, no error), sometimes it succeeds, and I cannot figure out why. Any advice would be appreciated.

Thanks, Joseph

thek71 commented 3 months ago

Hi Joseph,

unfortunately I cannot really help, without knowing anything about the HPC environment you are using. In our cluster it works perfectly. With input of 14k cells it took about 36 hours, maybe less. Usually I run it with the following parameters on the cluster

SBATCH --ntasks=4

SBATCH --mem-per-cpu 64000

SBATCH -t 0

Did the installation process give you any errors or warnings? Maybe there is something in the local environment that was not installed. But I can only speculate.

Best, Katia

jojobrew commented 3 months ago

Thank you Katia, I will try a few more things, but can I ask, why do you use ntasks=4? Also, do you have any advice on how many cores to use? At the moment I'm trying to use 16 for a 6k cell job. Thanks so much for the help!

Best, Joseph

thek71 commented 3 months ago

What I am usually using is 4 cores. Also for the 14k dataset I used the same configuration. The ntask option is actually leftover from another submission that I am too lazy to change ;).

Best, Katia