Closed scharch closed 5 years ago
This is not normal. From the output above, it seems IgDiscover is doing the IgBLAST step. Although it is the slowest, it should be finished within a couple of hours for one million sequences, especially if it is using 16 CPU cores.
Also, a progress report should be printed every minute, something like this:
...
INFO: Processed 500,000 sequences at 32.3 ms/sequence
Do you run any other CPU-intense jobs on the machine? You can check your load average by running uptime
. If IgDiscover is still running, the last three numbers should be at your number of CPU cores (16) or higher. If the numbers are below 1, it’s probably doing nothing and you should cancel it. Just try igdiscover run
again and see if it works this time. I don’t know how, yet, but I can try to help more after the weekend.
I am running on a cluster where I have only reserved one core, though IgDiscover has autodetected that the node has 16 total cores. Could that be causing the issue? Is there a way to tell IgDiscover to restrict itself to N cores even if the machine has more?
Oh ok, that explains it. IgDiscover tries to be smart when it detects how many cores to use, so it should actually take into account how many cores are available vs how many cores the node actually has. It seems that this logic has failed here. You can tell it to use only one core with igdiscover run -j 1
.
That indeed fixed it, thanks!
If you have time, could you run these two commands within a batch job for which you reserved only one core:
grep pus_allowed /proc/self/status
and
python3 -c 'import os; print(len(os.sched_getaffinity(0)))'
and send me the output?
schrammca@ai-hpcn021:~$ grep pus_allowed /proc/self/status
Cpus_allowed: 000000,00000000,00000000,0000ffff
Cpus_allowed_list: 0-15
schrammca@ai-hpcn021:~$ python3 -c 'import os; print(len(os.sched_getaffinity(0)))'
16
Also, it turns out that IgDiscover is still hanging after a few tens or hundreds of thousands of sequences. I tried reserving 8 cores and then using -j 6
, which may possibly have resulted in the program getting a little farther before hanging, but it still said it had used 6 hrs of CPU time when I killed it 3.5 days later. It also seems to get through more sequences in a light chain sample than heavy chain, but again, I haven't tested that rigorously.
Ok, it seems that your cluster is not using the cpuset mechanism to restrict the number of CPU cores to use. You’ll have to continue to use -j
.
Regarding the hangs: Did you now get a progress report (`Processed ... sequences at ... ms/sequence')? Or how did you assess that it got through more sequences?
I need to know at which stage the hang occurs. Is it consistently the igdiscover igblast
step? If so, can you SSH into the node and tell how many igblastn
subprocess are running when IgDiscover appears to hang? Do they just sit there idle at 0% CPU (they shouldn’t)? igdiscover igblast
spawns as many igblastn
subprocesses as specified with -j
. I could imagine that for some reason the igblastn
subprocesses are the ones that hang.
It’s hard for me to debug this remotely without access to the cluster, but I’m guessing it’s either some pecularity of your cluster system (possibly exposing an IgDiscover bug) or something in your data. I don’t know how much time you’re willing to spend on this since you obviously have your own pipeline doing similar things, but to find out which of the two is the problem, you could try to 1) run IgDiscover on a separate machine (a laptop with enough RAM should be fine) and/or 2) run IgDiscover on ERR1760498, which I’m using for testing all the time and which I’m very sure should work.
I can also prepare a new release of IgDiscover to see whether that fixes it, but this is going to take a bit.
Yes, getting the progress messages now. Yesterday it hung on ERR1760498, too. Today, it seems to be working on my sample after all --at least, I've made it to the second iteration... I've gone down now to -j 3
(on a session with 4 cores reserved). I wonder if it's possible that threads are getting into conflict somehow if there are too many of them, regardless of the number of cores I've reserved. In any event, for now I think I've got what I need.
Ok, thanks. A couple of days ago, I got an out of memory error myself that made the igblast step hang, but I hadn’t seen that before nor could I reproduce it afterwards. Anyway, I’ll keep this in mind fo the future.
How long should IgDiscover take to run on ~1M sequences? The display hasn't changed in the 24 hours now. Most recent output: