Issue about the cgroup out-of-memory handler and taking a long time

HLHsieh commented 1 year ago

Hi there,

Thanks for this nice and great tool. I used it on my several sets of data. Most of them worked smoothly, but three sets of data were stopped by some reasons.

I executed the following command for these three sets:

python $script -i ${myseq}.fasta -t fasta -r $genome -b $predefined -c 12 --samtools $samtools_path --minimap2 $minimap2_path -o ${myseq}

Among two, I got "slurmstepd: error: Detected 1 oom-kill event(s) in StepId=50190480.batch. Some of your processes may have been killed by the cgroup out-of-memory handler." I have tried to request the maximum memory as I can, but it did not work out.

Here is the requested resource for each job

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=15GB
#SBATCH --time=24:00:00

The other one took really long (24 hour) compared to other set of data (usually about 2 hours).

I would appreciate it if you could advise.

Many Thanks, Hsin

fangli80 commented 1 year ago

It might be that there are very large repeats in the datasets. Can you dig out the last command before it was killed? You can find it from the stderr output, such as:

HLHsieh commented 1 year ago

Here it is:

[04/02/2023 18:44:37] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 12  -x map-ont  /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round1_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round1.paf 2> /dev/null
[04/02/2023 18:46:38] NOTICE: Step 2 finished
[04/02/2023 18:46:38] NOTICE: Step 3: round 3 estimation
sh: line 1: 4093652 Killed                  /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -x map-ont -N 100 -c --eqx -t 12 /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3.paf 2> /dev/null
[04/02/2023 18:47:07] ERROR: Failed to run command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2  -x map-ont  -N 100 -c --eqx -t 12 /gpfs/
accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3_ref.fasta /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/core_sequences.fastq > /gpfs/accounts/home/NanoRepeat/C9ORF72_p3_NanoSim_100x.NanoRepeat_temp_dir.chr9-27573494-27573708-GGCCCC/round3.paf 2> /dev/null
[04/02/2023 18:47:07] Return value is: 35072
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=50199088.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

They all indeed have very large repeats. Although the largest one took 24 hours, it was done. The other two encountered above issue.

Best, Hsin

SeAudet commented 1 year ago

Hello,

Just wanted to mention I encountered the same issue where round 3 estimation never ends, although it seemingly doesn't run out of memory according to our ressource manager. That single step ran for over 6 days with 8 cores and 32GB of RAM (used only 17.5GB) before being timed out. It did initially mention the oom-kill event when running with only 16GB, but increasing allocated memory seemingly fixed that issue.

From what I can gather, the nature of the repeat is not the issue, but rather the amount of available data seems to be bottlenecking the process. It ran in less than a day with around 10K reads, but for samples where over 100K reads are available (good quality reads with nice repeats), the processing is perhaps too slow and doesn't increase with more cores/memory. I removed the 2>dev/null to see if there was hidden error, but it seems the command line doesn't cause an error.

I'll probably just randomly subset my data into technical replicates for it to run (output is generally overall very nice from my tests!), but thought it was a good idea to mention I also got that problem! Thank you in advance for your time!

Sincerely, Seb

fangli80 commented 1 year ago

Hello @SeAudet Thanks for letting me know. NanoRepeat was tested on datasets of 50-200X coverage. If there are 10K-100K reads, speed could be an issue.

Usually I can get accurate estimation of repeat sizes from < 1000X coverage. So it's okay to sub-sample the dataset.

I will work on improving speed for future versions.

Sincerely, Li

HLHsieh commented 1 year ago

Hi @fangli80 ,

I would like to provide some additional information about my analysis. Specifically, I would like to mention that my dataset was generated at a coverage of 100x.

Sincerely, Hsin

fangli80 commented 1 year ago

Thanks for letting me know. So you are working on a 100X dataset with multiple repeat regions? May I ask how many repeat regions are there?

Thanks, Li

HLHsieh commented 1 year ago

Hi @fangli80 This is the bed file I used for this 100X dataset

chr9    27573494        27573708        GGCCCC

fangli80 commented 1 year ago

@HLHsieh

It seems that the location of the GGCCCC repeat is not accurate. If I extract the region chr9:27573494-27573708 from hg38 or hg19, I got the following sequence:

hg38_chr9:27573495-27573708: GGGCCCGCCCCCGGGCCCGCCCCGACCACGCCCCGGCCCCGGCCCCGGCCCCTAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACAGGCTGCGGTTGTTTCCCTCCTTGTTTTCTTCTGGTTAATCTTTATCAGGTCTTTTCTTGTTCACCCTCAGCGAGTACTGTGAGAGCAAGTAGTGGGGAGAGAGGGTGGGAAAAACAAAAACACACAC

hg19_chr9:27573495-27573708: GCCCGCCCCCGGGCCCGCCCCGACCACGCCCCGGCCCCGGCCCCGGCCCCTAGCGCGCGACTCCTGAGTTCCAGAGCTTGCTACAGGCTGCGGTTGTTTCCCTCCTTGTTTTCTTCTGGTTAATCTTTATCAGGTCTTTTCTTGTTCACCCTCAGCGAGTACTGTGAGAGCAAGTAGTGGGGAGAGAGGGTGGGAAAAACAAAAACACACACCT

Only the first a 28-30 bp is the repeat. I checked the repeatMasker annotation (from here). This repeat position is chr9:27573485-27573546

The latest version of NanoRepeat will check if the repeat region is correct. And it will give an warning message if the repeat region is not accurate.

You can use the following command to install the latest version of NanoRepeat:

git clone https://github.com/WGLab/NanoRepeat.git
cd NanoRepeat
pip install .

If you supply with the correct region, NanoRepeat can finish repeat quantification in a few minutes.

HLHsieh commented 1 year ago

Hi Li,

I have tried the latest version of the software you provided, but unfortunately, it did not work for me. I encountered an error message indicating an issue with running the command:

python /bin/NanoRepeat/src/NanoRepeat/nanoRepeat.py -i /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/DRD4_p2_NanoSim_2x.fasta -t fasta -r /Reference/Human/Genome/hg38/genome.fa -b /reference/myDefinedRepeat_NanoRepeat_chr11.bed -c 4 --samtools /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools --minimap2 /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -o DRD4_p2_NanoSim_2x
[05/12/2023 14:47:49] NOTICE: Input file is: /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/DRD4_p2_NanoSim_2x.fasta
[05/12/2023 14:47:49] NOTICE: Input type is: fasta
[05/12/2023 14:47:49] NOTICE: Reference fasta file is: /Reference/Human/Genome/hg38/genome.fa
[05/12/2023 14:47:49] NOTICE: Output prefix is: /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x
[05/12/2023 14:47:49] NOTICE: Repeat region bed file is: /reference/myDefinedRepeat_NanoRepeat_chr11.bed
[05/12/2023 14:49:27] ERROR: Failed to run command: /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools view -hb -@ 4 /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x.minimap2.sam > /DRD4_p2_stimulated/DRD4_p2_NanoSim_2x/NanoRepeat_new/DRD4_p2_NanoSim_2x.minimap2.bam 2> /dev/null
[05/12/2023 14:49:27] Return value is: 256

However, the previous version of the software is still functional for me, so I will continue using that version. I wanted to bring this issue to your attention.

Additionally, I am still encountering the out-of-memory issue with the cgroup handler in several of my analyses, even after making the necessary corrections to the repeat bed file.

These particular analysis involves 6 bp with a repeat size of 3000 and its coverage is 100x with about 2400000 reads.

Many thanks, Hsin

fangli80 commented 1 year ago

Hello Hsin, Thanks for letting me know. It seems that the data is simulated. If it is not patient data, could you please email it to me so that I can test on my end?

By the way, why it has 2400000 reads but the coverage is 100X ? Is it because the 2400000 reads are from many different repeat regions?

Thanks, Li

fangli80 commented 1 year ago

To support pip install, the new version has changed installation methods. 1) if you want to install the latest version from GitHub, please run:

git clone https://github.com/WGLab/NanoRepeat.git
cd NanoRepeat
pip install .

If successful, nanoRepeat.py will be in a folder that is in the $PATH variable and you can directly run nanoRepeat.py

Please don't run python ./NanoRepeat/src/NanoRepeat/nanoRepeat.py directly because this is the source code and is not the installed path any more.

2) if you want to install a specific version that was released (e.g. v1.4.0), you can use:

pip install NanoRepeat==1.4.0

Same as above, nanoRepeat.py will be in a folder that is in the $PATH variable (usually a /bin folder) and you can directly type nanoRepeat.py without specifying the full path.

HLHsieh commented 1 year ago

Hi Li,

Thank you for your continued support. I wanted to let you know that I have tried the latest version 1.5 of the tool, and I'm happy to report that I did not encounter the issue with the cgroup out-of-memory handler that I had experienced during some of my previous analyses. I must say, this tool is truly amazing and incredibly easy to access. Thank you for your assistance.

Many thanks, Hsin

fangli80 commented 1 year ago

Thanks for reporting bugs to me. Please feel free to let me know if there are other issues.

Cheers, Li

HLHsieh commented 1 year ago

Hi Li,

Unfortunately, I have encountered the same issue again during my analysis.

I ran the following command using NanoRepeat v1.5, along with minimap2 v2.24 and samtools v1.13:

nanoRepeat.py -i /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/C9ORF72_3_NanoSim_30x.fasta -t fasta -r /Reference/Human/Genome/hg38/genome.fa -b /reference/myDefinedRepeat_NanoRepeat_chr9.bed -c 8 --samtools /sw/spack/bio/pkgs/gcc-10.3.0/samtools/1.13-fwwss5nm/bin/samtools --minimap2 /minimap2-2.24_x64-linux/minimap2 -o C9ORF72_3_NanoSim_30x

Here are some messages that appeared before the error message:

[05/25/2023 01:19:26] NOTICE: Step 3: round 3 estimation
[05/25/2023 01:19:26] NOTICE: Running command: /minimap2-2.24_x64-linux/minimap2  -x map-ont  -f 0.0 -N 100 -c --eqx -t 8 /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3_ref.fasta /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/core_sequences.fastq > /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3.paf
[05/25/2023 01:24:52] ERROR: Failed to run command: /minimap2-2.24_x64-linux/minimap2  -x map-ont  -f 0.0 -N 100 -c --eqx -t 8 /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3_ref.fasta /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/core_sequences.fastq > /C9ORF72_3_stimulated/C9ORF72_3_NanoSim_30x/NanoRepeat/C9ORF72_3_NanoSim_30x.NanoRepeat_temp_dir.chr9-27573528-27573546-GGCCCC/round3.paf
[05/25/2023 01:24:52] Return value is: 35072
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=53724533.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I would like to highlight that this error occurred during Step 3: round 3 estimation, and the return value was 35072. I'm curious to know if there are any specific reasons for this error.

I would greatly appreciate your insights or suggestions regarding this issue.

Thank you for your assistance, Hsin

fangli80 commented 1 year ago

Sorry for the late reply.
I noticed that the input data is C9ORF72_3_NanoSim_30x.fasta. Is it different from the data that you shared with me (C9ORF72_p2_NanoSim_30x.fasta)? If so, what is the difference? Thanks, Li

HLHsieh commented 1 year ago

Hi Li,

Thank you for your response. I have been persistently working on these data analysis, and I finally achieved a successful analysis this morning. I would like to share some information with you.

For the C9ORF72_p2_NanoSim_30x.fasta dataset, the memory consumption is approximately 15-20 GB, and the analysis takes around 1 hour to complete. On the other hand, for the C9ORF72_3_NanoSim_30x.fasta dataset, the memory consumption is considerably higher at around 110-120 GB, and the analysis takes approximately 5 hours to finish. It is important to note that both datasets have the same sequencing depth.

If you are interested in investigating the reasons behind this discrepancy, I would be more than happy to share it with you.

Thank you once again for your support.

Best regards, Hsin

fangli80 commented 1 year ago

Hello Hsin, It would be great if you can share the data with me.

Thanks! Li

WGLab / NanoRepeat

Issue about the cgroup out-of-memory handler and taking a long time #7