NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

How do I tell if IgDiscover has hung? #99

Closed scharch closed 5 years ago

scharch commented 5 years ago

How long should IgDiscover take to run on ~1M sequences? The display hasn't changed in the 24 hours now. Most recent output:

[Thu Jun  6 10:32:39 2019]
rule 20:
    input: reads/4-trimmed.fastq.gz, stats/trimming-successful
    output: reads/sequences.fasta.gz, stats/groups.json
    jobid: 45

igdiscover dereplicate --trim-g --json=stats/groups.json --minimum-length=300 reads/4-trimmed.fastq.gz | gzip > reads/sequences.fasta.gz
INFO: 1729883 sequences processed
INFO: 1078607 sequences long enough
INFO: 1048469 dereplicated sequences written
INFO: CPU time 00:00:37.4. Maximum memory usage 1.807 GB
[Thu Jun  6 10:33:46 2019]
Finished job 45.
20 of 63 steps (32%) done

[Thu Jun  6 10:33:46 2019]
rule igdiscover_igblast:
    input: reads/sequences.fasta.gz, iteration-01/database/V.fasta, iteration-01/database/D.fasta, iteration-01/database/J.fasta
    output: iteration-01/assigned.tab.gz, iteration-01/stats/assigned.json
    log: iteration-01/igblast.log
    jobid: 52
    wildcards: dir=iteration-01
    threads: 16

time igdiscover igblast --sequence-type=Ig --threads=16 --stats=iteration-01/stats/assigned.json iteration-01/database reads/sequences.fasta.gz | gzip 2>&1 > iteration-01/assigned.tab.gz | tee iteration-01/igblast.log >&2
WARNING: Length 104 of FR3 region in 'IGLV3-ACB*03' is not divisible by three; region info for this gene will not be available
WARNING: Length 104 of FR3 region in 'IGLV3-AAJ*01' is not divisible by three; region info for this gene will not be available
marcelm commented 5 years ago

This is not normal. From the output above, it seems IgDiscover is doing the IgBLAST step. Although it is the slowest, it should be finished within a couple of hours for one million sequences, especially if it is using 16 CPU cores.

Also, a progress report should be printed every minute, something like this:

...
INFO: Processed    500,000 sequences at 32.3 ms/sequence

Do you run any other CPU-intense jobs on the machine? You can check your load average by running uptime. If IgDiscover is still running, the last three numbers should be at your number of CPU cores (16) or higher. If the numbers are below 1, it’s probably doing nothing and you should cancel it. Just try igdiscover run again and see if it works this time. I don’t know how, yet, but I can try to help more after the weekend.

scharch commented 5 years ago

I am running on a cluster where I have only reserved one core, though IgDiscover has autodetected that the node has 16 total cores. Could that be causing the issue? Is there a way to tell IgDiscover to restrict itself to N cores even if the machine has more?

marcelm commented 5 years ago

Oh ok, that explains it. IgDiscover tries to be smart when it detects how many cores to use, so it should actually take into account how many cores are available vs how many cores the node actually has. It seems that this logic has failed here. You can tell it to use only one core with igdiscover run -j 1.

scharch commented 5 years ago

That indeed fixed it, thanks!

marcelm commented 5 years ago

If you have time, could you run these two commands within a batch job for which you reserved only one core:

grep pus_allowed /proc/self/status

and

python3 -c 'import os; print(len(os.sched_getaffinity(0)))'

and send me the output?

scharch commented 5 years ago
schrammca@ai-hpcn021:~$ grep pus_allowed /proc/self/status
Cpus_allowed:   000000,00000000,00000000,0000ffff
Cpus_allowed_list:  0-15
schrammca@ai-hpcn021:~$ python3 -c 'import os; print(len(os.sched_getaffinity(0)))'
16

Also, it turns out that IgDiscover is still hanging after a few tens or hundreds of thousands of sequences. I tried reserving 8 cores and then using -j 6, which may possibly have resulted in the program getting a little farther before hanging, but it still said it had used 6 hrs of CPU time when I killed it 3.5 days later. It also seems to get through more sequences in a light chain sample than heavy chain, but again, I haven't tested that rigorously.

marcelm commented 5 years ago

Ok, it seems that your cluster is not using the cpuset mechanism to restrict the number of CPU cores to use. You’ll have to continue to use -j.

Regarding the hangs: Did you now get a progress report (`Processed ... sequences at ... ms/sequence')? Or how did you assess that it got through more sequences?

I need to know at which stage the hang occurs. Is it consistently the igdiscover igblast step? If so, can you SSH into the node and tell how many igblastn subprocess are running when IgDiscover appears to hang? Do they just sit there idle at 0% CPU (they shouldn’t)? igdiscover igblast spawns as many igblastn subprocesses as specified with -j. I could imagine that for some reason the igblastn subprocesses are the ones that hang.

It’s hard for me to debug this remotely without access to the cluster, but I’m guessing it’s either some pecularity of your cluster system (possibly exposing an IgDiscover bug) or something in your data. I don’t know how much time you’re willing to spend on this since you obviously have your own pipeline doing similar things, but to find out which of the two is the problem, you could try to 1) run IgDiscover on a separate machine (a laptop with enough RAM should be fine) and/or 2) run IgDiscover on ERR1760498, which I’m using for testing all the time and which I’m very sure should work.

I can also prepare a new release of IgDiscover to see whether that fixes it, but this is going to take a bit.

scharch commented 5 years ago

Yes, getting the progress messages now. Yesterday it hung on ERR1760498, too. Today, it seems to be working on my sample after all --at least, I've made it to the second iteration... I've gone down now to -j 3 (on a session with 4 cores reserved). I wonder if it's possible that threads are getting into conflict somehow if there are too many of them, regardless of the number of cores I've reserved. In any event, for now I think I've got what I need.

marcelm commented 5 years ago

Ok, thanks. A couple of days ago, I got an out of memory error myself that made the igblast step hang, but I hadn’t seen that before nor could I reproduce it afterwards. Anyway, I’ll keep this in mind fo the future.