amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Snap usage causes kernel 'soft lockup' bugs on Linux; odd performance profile #81

Closed taltman closed 3 years ago

taltman commented 7 years ago

While completing an index of a 7.6GB metagenomic reference sequence FASTA file, I noticed two worrisome issues with Snap, possibly related:

  1. Running the Snap indexing job lead to a flurry of kernel warning messages echoing to the terminal (example below)
  2. Snap took almost a day to build the index, with much of that time seemingly idle

Full story:

Snap was running on a beefy machine with 80 cores and one Terabyte of RAM (yes, that's right), and the file I/O was all happening in a large RAM-disk, so there was no real contention possible, nor observed:

uname -a:
Linux host 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

---

taltman$ free
             total       used       free     shared    buffers     cached
Mem:    1058724080 1021966988   36757092  141670332     604476  926110008
-/+ buffers/cache:   95252504  963471576
Swap:      4194300    4194300          0

---

taltman$ nproc
80

Snap was built from a recent Github clone:

taltman$ which snap-aligner
/home/taltman/repos/snap/snap-aligner
taltman$ snap-aligner 
Welcome to SNAP version 1.0beta.23.

Snap was executed as follows, with all of the disk I/O taking place to and from a RAM-disk:

time nice -n 10 snap-aligner index /dev/shm/taltman/large.fasta \
                                   /dev/shm/taltman/snap_db \
                                   -t60
                   -B\| \
                   -large \
                   -locationSize 5

While executing, the following error message was echoing to the terminal:

 Message from syslogd@host at Jan 29 06:43:09 ...
 kernel:BUG: soft lockup - CPU#2 stuck for 63s! [snap-aligner:30602]

For much of the time that Snap was running, I noticed that it alternated between running reasonably on a single core, to being idle, and not utilizing any CPU. It seemed to only honor the "-t60" argument for a single, relatively brief phase of the program's execution. There were other programs running, but none of them were using all of the machine's CPUs; there were plenty left idle while Snap was executing, and Snap declined to make use of them for most of the execution run. During the execution phase that triggered the kernel messages, there was 0% wait, so no observed I/O contention.

At the end of the indexing run, the program seemed to exit cleanly, and the output seemed healthy:

ls -lh
total 128G                                                 
-rw-r--r-- 1 taltman taltman 7.6G Jan 29 01:13 Genome
-rw-r--r-- 1 taltman taltman   43 Jan 29 20:26 GenomeIndex
-rw-r--r-- 1 taltman taltman  90G Jan 29 20:25 GenomeIndexHash
-rw-r--r-- 1 taltman taltman  31G Jan 29 20:26 OverflowTable                                                                                                                                                                                                      

his was the 'time' output:

real    1221m9.606s
user    698m37.541s
sys     746m12.868s

So, Snap's execution spanned 20 hours (with significant periods of idleness) to process a 7.6 GB FASTA file on a beefy machine with medium utilization, and no evidence of resource contention. Is that normal? Is there some connection between the slow execution time and the kernel warnings?

bolosky commented 7 years ago

That’s clearly not working properly. It should probably take less than an hour for that reference on that machine.

Essentially all of the development of SNAP is done on Windows, since the developers are at Microsoft (the Berkeley folks quit years ago). So, I haven’t got quite as much experience with the Linux version.

That message seems to indicate that the kernel is stuck spinning in a lock for an unreasonable amount of time. I have a guess as to what’s going on. I’m assuming that this happened in the hash table build phase (after it says “building hash tables.”) While that phase is parallel, in practice it’s only as parallel as the number of hash tables, since there’s a lock per-table that allows them to be written correctly. For the parameters that you provided, there should be 256 hash tables, which you’d think wouldn’t case all that much lock contention for 60 threads, but it may be that some of them are used much more than others because some sequences occur more in the reference than others. So, maybe it’s getting a decent amount of lock contention and Linux doesn’t deal with it well, especially on a NUMA machine like yours doubtless is.

You could try turning down the number of threads a lot, say to 10 or something and seeing if that helps. It will certainly reduce the lock contention.

--Bill

From: Tomer Altman [mailto:notifications@github.com] Sent: Sunday, January 29, 2017 10:40 PM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] Snap usage causes kernel 'soft lockup' bugs on Linux; odd performance profile (#81)

While completing an index of a 7.6GB metagenomic reference sequence FASTA file, I noticed two worrisome issues with Snap, possibly related:

  1. Running the Snap indexing job lead to a flurry of kernel warning messages echoing to the terminal (example below)
  2. Snap took almost a day to build the index, with much of that time seemingly idle

Full story:

Snap was running on a beefy machine with 80 cores and one Terabyte of RAM (yes, that's right), and the file I/O was all happening in a large RAM-disk, so there was no real contention possible, nor observed:

uname -a:

Linux host 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


taltman$ free

         total       used       free     shared    buffers     cached

Mem: 1058724080 1021966988 36757092 141670332 604476 926110008

-/+ buffers/cache: 95252504 963471576

Swap: 4194300 4194300 0


taltman$ nproc

80

Snap was built from a recent Github clone:

taltman$ which snap-aligner

/home/taltman/repos/snap/snap-aligner

taltman$ snap-aligner

Welcome to SNAP version 1.0beta.23.

Snap was executed as follows, with all of the disk I/O taking place to and from a RAM-disk:

time nice -n 10 snap-aligner index /dev/shm/taltman/large.fasta \

                               /dev/shm/taltman/snap_db \

                               -t60

                              -B\| \

                              -large \

                              -locationSize 5

While executing, the following error message was echoing to the terminal:

Message from syslogd@host at Jan 29 06:43:09 ...

kernel:BUG: soft lockup - CPU#2 stuck for 63s! [snap-aligner:30602]

For much of the time that Snap was running, I noticed that it alternated between running reasonably on a single core, to being idle, and not utilizing any CPU. It seemed to only honor the "-t60" argument for a single, relatively brief phase of the program's execution. There were other programs running, but none of them were using all of the machine's CPUs; there were plenty left idle while Snap was executing, and Snap declined to make use of them for most of the execution run. During the execution phase that triggered the kernel messages, there was 0% wait, so no observed I/O contention.

At the end of the indexing run, the program seemed to exit cleanly, and the output seemed healthy:

ls -lh

total 128G

-rw-r--r-- 1 taltman taltman 7.6G Jan 29 01:13 Genome

-rw-r--r-- 1 taltman taltman 43 Jan 29 20:26 GenomeIndex

-rw-r--r-- 1 taltman taltman 90G Jan 29 20:25 GenomeIndexHash

-rw-r--r-- 1 taltman taltman 31G Jan 29 20:26 OverflowTable

his was the 'time' output:

real 1221m9.606s

user 698m37.541s

sys 746m12.868s

So, Snap's execution spanned 20 hours (with significant periods of idleness) to process a 7.6 GB FASTA file on a beefy machine with medium utilization, and no evidence of resource contention. Is that normal? Is there some connection between the slow execution time and the kernel warnings?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F81&data=02%7C01%7Cbolosky%40microsoft.com%7C3655d10bf5154cff57a208d448dada60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636213552178664987&sdata=yLmzRrByJuZr%2BCYO%2FkyzVKa8L4y%2B0e6JYyyqAG7uFqg%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752du2XXWQuYtIGbhyjGG6g5PeLibRks5rXYXOgaJpZM4LxFlm&data=02%7C01%7Cbolosky%40microsoft.com%7C3655d10bf5154cff57a208d448dada60%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636213552178674995&sdata=yVT1i4bUEEak93reAeTlmAtiNc6gN6qlMwmsjqzIvnI%3D&reserved=0.