DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Can I split stage 1 indexing onto subsequent jobs? #40

Closed asan-emirsaleh closed 1 year ago

asan-emirsaleh commented 1 year ago

Thank you for really amazing tool.

I have faced with some errors when I tried to use Ratatosk. Generally it was a time or memory limit errors. Execution have been stopped on the indexing stage 1, after a days of computation. As I saw at #35, such behavior is not a norm, and some errors may be incorporated in my pipeline. So I need a way to split indexing stage onto separate jobs for achieved intermediate files to be saved. If I simple rerun the indexing stage, it would not use computed files and would begin from the zero level. So in that case I would loss a days of computation.

So I allocated 240 Gb of RAM and 120 threads via workload manager. Execution spend 12 days until be stopped

$ sacct -j {JobId} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize 
---------- ---------- ----------- ---------- ---------- 
   1070438 12-09:20:38          1 OUT_OF_ME+ 259099432K 

It resulted in two files

$ls -lah
drwxr-xr-x 4 asan clugroup    4 окт  2 16:28 .
drwxr-xr-x 4 asan clugroup   14 окт  2 17:02 ..
-rw-r--r-- 1 asan clugroup  59G окт  1 03:45 out_long_reads.2.pids.2
-rw-r--r-- 1 asan clugroup 7,7G сен 21 07:49 out_long_reads.index.k63.fasta

I used 150 bp PE reads (fastq) and corrected by overlap consensus ONT reads in fasta format as input data:

$echo "count of forward pe reads" ; <$SR1 bioawk -t -c fastx 'END {print NR}'
count of forward pe reads
235136088
$echo "count of reverse pe reads" ; <$SR2 bioawk -t -c fastx 'END {print NR}'
count of reverse pe reads
235136088
$echo "count of long reads" ; <$LR bioawk -t -c fastx 'END {print NR}'
count of long reads
597518

The genome which reads I have tried to correct was approximately of 1,4 Gb

The command I used was:

Ratatosk index \
    -1 \
    -v \
    -c $THR \
    -s $SR_LIST \
    -l $LR \
    -i 247 \
    -o out_long_reads

The last strings reported was:

$ tail slurm-{JobId}.out
Ratatosk::addCoverage(): Subsampling reads (0.454545) coloring vertices with k-mer coverage [70, 100[
Ratatosk::addCoverage(): Subsampling reads (0.454545) coloring vertices with k-mer coverage [100, 156[
Ratatosk::addCoverage(): Subsampling reads (0.454545) coloring vertices with k-mer coverage [156, 285[
Ratatosk::addCoverage(): Subsampling reads (0.454545) coloring vertices with k-mer coverage [285, 744[
Ratatosk::addCoverage(): Subsampling reads (0.454545) coloring vertices with k-mer coverage [744, 2968043[
Ratatosk::addCoverage(): Compacting subsampled reads.
Ratatosk::addCoverage(): Coloring graph with subsampled reads.
Ratatosk::addCoverage(): Compacting colors.
/var/spool/slurm/{JobId}/slurm_script: string 43: 1688302 Killed                  Ratatosk index -1 -v -c $THR -s $SR_LIST -l $LR -i 247 -o out_long_reads
slurmstepd: error: Detected 1 oom-kill event(s) in StepId={JobId}.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler

What you recommend to do?

GuillaumeHolley commented 1 year ago

Hi @asan-emirsaleh,

Wow, 12 days running time is indeed abnormal and there is clearly an issue here. In general, indexing is the most time and memory consuming step of Ratatosk and it is the only step that cannot be split between jobs. I would say that the issue in https://github.com/DecodeGenetics/Ratatosk/issues/35 is unrelated to yours since the problem there was that a very high coverage of short reads was used (170x) for the indexing which was making the running time exceedingly long. Also, since that issue, Ratatosk got a lot faster and now, for 60x coverage PE reads of a human genome, I wouldn't expect the indexing to go beyond 1000 CPU hours. You have only 50-60x coverage so coverage is not the issue here. I haven't had issues with that version of Ratatosk for as long as I can remember now.

However, looking at the output of your sacct command, I see that NCPUS=1 but you said you have allocated 120 threads to this job. In general, "CPU" and "core" means two different things but in the context of Slurm, I believe they mean the same. Or they might mean "CPU cores". Point is, I think that NCPUS should be equal to 120 here. Is it possible that Ratatosk was run with a single thread rather than 120, which would explain the very long running time? Or that you got a one thread job from Slurm but tried to use 120 which would also result in a long running time?

Guillaume

asan-emirsaleh commented 1 year ago

Hi @GuillaumeHolley Yes you are right. The issue seems be related to thread allocation. Normally I define threads directly in script (not in workload manager params) and SLURM counts how many threads I request by analyzing the script code. In case with Ratatosk it couldn't understand my Ratatosk index -c 120 command. So I reran the script with direct allocation of CPU threads as params

#SBATCH --mem 500GB
#SBATCH -c 122

the sacct output were different but computation appeared to be long too

$sacct -j {JobId} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem  
---------- ---------- ---------- ---------- ---------- -------- ----------
  15241216 1-10:42:08        122    RUNNING                 122      500Gn
$tail slurm-{JobId}.out  
Ratatosk::addCoverage(): Processed 50000000 reads. 
Ratatosk::addCoverage(): Processed 51000000 reads.
Ratatosk::addCoverage(): Processed 52000000 reads.
Ratatosk::addCoverage(): Processed 53000000 reads.
Ratatosk::addCoverage(): Processed 54000000 reads.
Ratatosk::addCoverage(): Processed 55000000 reads.
Ratatosk::addCoverage(): Processed 56000000 reads.
Ratatosk::addCoverage(): Processed 57000000 reads.
Ratatosk::addCoverage(): Processed 58000000 reads.
Ratatosk::addCoverage(): Processed 59000000 reads.

So I am still waiting for job to complete. As you can see, NCPUs differs (in relation to the time consumed) but the indexing processes very slow. If there be some time checkpoints in the logs it were easy to compare two outputs, for example I would have seen how much time at every subsequent stage been consumed.

I want to ask you again, is there any way to split the indexing step onto consequent jobs, for example "building graph", "splitting unitigs", "joining unitigs", "coloring graph" for script be able to resume after it has been crashed? I am not familiar with C++ so I cannot propose a solution.

GuillaumeHolley commented 1 year ago

Hi @asan-emirsaleh,

Glad to hear there is a little bit of progress here. As I said before, the indexing stage is the only step of Ratatosk that cannot be split. There is currently no way of doing it. I am not saying it is impossible to implement but it would require some heavy work and unfortunately, I just do not have this kind of time available right now.

That being said, I think it is still very abnormal that after more than 4000 CPU hours, the indexing is only at this stage with that kind of input. There is something else going on. Memory is clearly not the issue since you gave the job 500 GB, I wouldn't expect Ratatosk to use that much. Now that we know the job has 122 cores allocated, I am sorry to ask but are you 100% sure that Ratatosk runs with -c 120 as argument? Is there any way for you to ssh to the machine, do a top and see how much CPU % Ratatosk is using? Something else you can do is check the output folder of Ratatosk (where the index will be written). There should be a file there named *.index.k63.fasta or something like that. The timestamp on that file (ls -lh) will tell you when it was last modified which is when Ratatosk wrote it to disk. Given that you know for how long the job has been running (1 day and about 11 hours at the time you wrote your last answer), you can make the difference between those two dates and know how long it took Ratatosk to get there. This would be helpful for me to know this information. This is obviously not optimal and I agree with you that a proper logging system with timestamps would be more helpful. I'll add this to my todo list.

asan-emirsaleh commented 1 year ago

Hello @GuillaumeHolley, There is unusual situation in the Ratatosk execution. Six days spend, and there are no results yet. The sacct output is

$ sacct -j {JobId} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem
---------- ---------- ---------- ---------- ---------- -------- ----------
  63384368 6-00:19:04        122    RUNNING                 122      500Gn

I have been saved the tail of Ratatosk output logs every 10 minutes, and also monitored the filesizes in the output dir. Then I have noted that the processed reads count increases in jump-like manner but the filesize increases linearly. You can see it in the plot below. Plot I have no access to compute nodes by ssh. HPC administrator said the program did not seem to be using any resources.

GuillaumeHolley commented 1 year ago

Hi @asan-emirsaleh,

At the stage of the indexing where you seem to be stuck, what Ratatosk does is a basic mapping of the short reads to the graph. Keeping the mapping information in RAM can be too heavy so in order to not use too much RAM, every 4GB of input short read processed, Ratatosk writes the mapping information to disk, clear these from the RAM and process the next 4GB. Processing short reads is parallel but writing to disk is serial. This most likely explains why your plot of the processed reads has this cascade-shape curve (in orange): the parts of the curve that go up very suddenly are the parallel parts but the flat parts of the curve are the writing to disk which is serial. And as you can see, there are a lot of flat parts on the orange curve. Based on your plot, I estimate that writing to disk is about 100 time slower than processing the reads which is very abnormal. Now that we know this, combined with the fact that your HPC admin said Ratatosk seemed to not be using any resources, leads me to think that there is an IO wait issue here. Basically, Ratatosk spends most of its time waiting to write to disk. There are a few things that could explain this IO issue:

Guillaume

GuillaumeHolley commented 1 year ago

Hi @asan-emirsaleh,

Did you have a chance to look at the eventual IO wait issue?

Thanks, Guillaume

asan-emirsaleh commented 1 year ago

Hi @GuillaumeHolley Yes you are right. There was an IO issue in the compute cluster. After the storage subsystem has been repared by admin the Ratatosk processed well.

$ sacct -j {index 1} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem,TresUsageInTot
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem TRESUsageInTot 
---------- ---------- ---------- ---------- ---------- -------- ---------- -------------- 
  11828022 1-02:55:51        122  COMPLETED                 122      500Gn                
  11828022 1-02:55:51        122  COMPLETED 346493000K      122      500Gn cpu=63-08:04:+ 
$ sacct -j {correction 1} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem,TresUsageInTot
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem TRESUsageInTot 
---------- ---------- ---------- ---------- ---------- -------- ---------- -------------- 
    704428   01:36:14        122  COMPLETED                 122      440Gn                
    704428   01:36:14        122  COMPLETED 147831976K      122      440Gn cpu=5-17:13:3+ 
$ sacct -j {index 2} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem,TresUsageInTot
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem TRESUsageInTot 
---------- ---------- ---------- ---------- ---------- -------- ---------- -------------- 
    521428   01:11:14        122  COMPLETED                 122      500Gn                
    521428   01:11:14        122  COMPLETED  58529680K      122      500Gn cpu=2-08:56:5+ 
$ sacct -j {correction 2} --format=CPUTimeRAW,Elapsed,NCPUS,State,AveVMSize,ReqCPUs,ReqMem,TresUsageInTot
CPUTimeRAW    Elapsed      NCPUS      State  AveVMSize  ReqCPUS     ReqMem TRESUsageInTot 
---------- ---------- ---------- ---------- ---------- -------- ---------- -------------- 
    195810   00:26:45        122  COMPLETED                 122      440Gn                
    195810   00:26:45        122  COMPLETED  47679736K      122      440Gn cpu=1-15:46:1+ 
$ ls -la
total 81388459
drwxr-xr-x 2 asan group           6 okt 20 01:51 .
drwxr-xr-x 6 asan group          43 okt 22 09:34 ..
-rw-r--r-- 1 asan group  5911261530 okt 19 19:39 out_long_reads.2.fastq
-rw-r--r-- 1 asan group  5906800800 okt 20 01:56 out_long_reads.fastq
-rw-r--r-- 1 asan group  5248526797 okt 19 03:02 out_long_reads.index.k31.fasta
-rw-r--r-- 1 asan group 50481293054 okt 19 03:13 out_long_reads.index.k31.rtsk
-rw-r--r-- 1 asan group  8166012081 okt 20 00:10 out_long_reads.index.k63.fasta
-rw-r--r-- 1 asan group  7627884759 okt 20 00:11 out_long_reads.index.k63.rtsk

I had a lot of RAM available but faced with disk IO issue. If I had an option to keep data in RAM until complete IO would not be delayed so much. May customizing of IO be a good idea for future improvement of the program. Thank you for quick response and a help in my issue!

Best regards Asan