Closed spock closed 6 years ago
RAM use seems to have started very slowly decreasing, looks like a false alarm, sorry :)
Is draft_assembly
using a lot of system calls, e.g. for memory allocation and freeing? In my case it still consumes very little USER
CPU time, but keeps consuming SYS
CPU time...
Code execution is currently in this block, for about 4 hours already:
for (int i = 0; i < n_read; i++){
reads[i]->active = maximal_read[i];
}
std::vector<int> range;
for (int i = 0; i < n_read; i++) {
if (reads[i]->active) range.push_back(i);
}
std::sort(range.begin(), range.end());
std::vector<LOverlap *> aln;//Vector of pointers to all alignments
std::vector<LAlignment *> full_aln;//Vector of pointers to all alignments
if (strlen(name_las) > 0) {
la.resetAlignment();
la.getOverlap(aln, range);
la.resetAlignment();
la.getAlignment(full_aln, range);
}
Another question: if this behavior is not normal - then is it safe to kill and start the process again on a bigger machine, without damaging any of the assembly-related data?
Are you tracing this program? I mean are you gdb
or valgrind
it? In this case it will do a ton a system calls. Otherwise USER
CPU time will be the majority.
draft does not use multiple las, but it is quite memory efficient.
It is safe to kill and restart at any point.
@fxia22 , thanks again for a quick reply to all of my questions :)
More information on what I'm observing:
gdb
/valgrind
/strace
for this run$ ps -o comm,pid,ppid,user,time,etime,start,pcpu,state --sort=comm aH | grep '^COMMAND\|R$'
COMMAND PID PPID USER TIME ELAPSED STARTED %CPU S
draft_assembly 27296 27294 spock 02:15:28 08:32:43 12:42:27 26.4 R
above, the pcpu
column (%CPU) is the fraction of time_using_CPU / total_running_time
top
for 10-15 seconds, then I see that %CPU is usually zero, state is R
, and TIME+
does not change: 2551 spock 20 0 0.198t 0.195t 1812 R 0.000 39.53 58:16.59 draft_assembly
%CPU
increases to 100% for a brief moment, and then it just sits there again :)draft_assembly
, is around 700kOne possibly important detail: I have built Hinge on a machine with much newer kernel and libs (Ubuntu xenial), compared to the quite old kernels and libs where I'm running Hinge.
I agree that it is quite memory efficient: on an almost-900GB .las file, peak memory use was 195GB so far.
The process is still in that same stage, I guess I need to try compiling it on the target machine to see if that fixes the problem. I'll provide more feedback when I have more information...
COMMAND PID PPID USER TIME ELAPSED STARTED %CPU S
draft_assembly 27296 27294 spock 05:12:56 21:42:03 12:42:27 24.0 R
Update 2017-07-10: the process is still running, but after ~24 hours of runtime it did start using more %user
time - so I have not recompiled it, just let it run for now (on 2 machines). Will provide an update/feedback when it finishes.
Update 2017-07-11: original process is still running. I started another draft_assembly
on a slightly bigger las file (~1.5TB), and after it finished reading all the data - there was a similar pattern of 0% user time and full system time. So I have now recompiled draft_assembly
(but none of the dependecies) on the target machine - let's see what happens.
Update 2, 2017-07-11: recompiling on the target machine does not seem to have any noticeable effect.
Hi @spock,
Did the process terminate? or do we have a bug there?
The process is still not finished:
$ ps -o comm,pid,ppid,user,time,etime,start,pcpu,state --sort=comm aH | grep '^COMMAND\|R$'
COMMAND PID PPID USER TIME ELAPSED STARTED %CPU S
draft_assembly 27296 27294 spock 5-15:24:12 9-00:53:36 Jul 04 62.4 R
Now it is consuming nearly 100% of user
cpu time, and memory use during these 9 days of runtime has decreased from 195GB to 172GB - so I think it might be working just fine, but extremely slow.
The output lines haven't changed, so I believe it is still in the same segment of code as shown above in https://github.com/HingeAssembler/HINGE/issues/121#issuecomment-312905062 :
[2017-07-04 12:42:27.365] [log] [info] # Alignments: 7472368456
[2017-07-04 12:42:27.365] [log] [info] # Reads: 701638
[2017-07-04 12:42:52.891] [log] [info] Total number of active reads: 11498/701638
The first ~24 hours of runtime (when user
CPU time was nearly zero, and sys
CPU time was very high) look like there is a lot of memory re-allocation going on; however, this is just a guess, as I haven't traced the program, and also haven't looked into the functions called from this block:
if (strlen(name_las) > 0) {
la.resetAlignment();
la.getOverlap(aln, range);
la.resetAlignment();
la.getAlignment(full_aln, range);
}
I wonder if replacing push_back
inside a loop with a pre-allocated array would shorten the high-sys
-time of the program? Something like this:
int active_reads = 0; # counter for allocation below
for (int i = 0; i < n_read; i++) {
reads[i]->active = maximal_read[i];
if (reads[i]->active) active_reads++;
}
int *range = malloc(active_reads * sizeof(int));
int active_index = 0;
for (int i = 0; i < n_read; i++) {
if (reads[i]->active) range[active_index] = i;
active_index++;
}
# ....
free(range)
If you want to add a little bit more logging, or some other means of identifying the bottleneck, or a slightly different code - I'll be able to run updated code on my dataset (but I'm limited to maybe 1 or 2 more processes, can't consume more resources).
The problem could be with my PacBio data, because when I run Hinge only on the nanopore part of the dataset, I do not see any problems. I do not yet know what exactly is "wrong" with PacBio dataset.
@fxia22 found a bug here wrt the file i/o, which is slowing things down. He will fix it by the middle of next week.
Thanks! I had to stop my hybrid assembly trials for now, but I can try again later :)
Hi @spock, we have resolved the issue of draft assembly, you are welcome to have a try :D
Thank you, @fxia22 !
I had to finish this project in a different way and move on to another project, but I'll give Hinge another try when I get more PB/ON data.
Following the E.coli demo, up until the
hinge draft
command I could use multiple.las
files (the--mlas
option); even for the DASqv, one can still use multiple.las
files - there is no need to merge. I have now startedhinge draft
and found that it does not support the--mlas
option; RAM use is now beyond 195 Gbytes, and keeps growing - I'm afraid it will run out of memory on a huge input.las
file, even despite the message[2017-07-04 12:42:52.891] [log] [info] Total number of active reads: 11498/701638
.Is there anything that I can do to make it fit into 256..512 GB?
Update 1: maybe I panicked for no reason, it stopped growing at 80% of RAM (195 GBytes), just when I posted...Oh actually I see a persistent 5%
sys
load on this 20-core machine, while the process is inR
state but does not consume any CPU, strange.