Lookup process use 100% CPU in Debian

ghost commented 7 years ago

Good morning.

We use OpenSMTPD portable for relaying the e-mails for our LiveChat service. We have a configuration with two servers running OpenSMTPD (6.0.2p1) in Debian 8 (jessie). For some motivations (and it seems to be related to some bad DNS answers from AOL), the lookup process start to eat 100% CPU. The machine remain responsive with a load average around 1, there is no service outage, but it's maybe a problem somewhere else.

Here what's happening from an strace of the process:

epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364557639}) = 0 gettimeofday({1481540024, 38296}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364636631}) = 0 gettimeofday({1481540024, 38375}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364717188}) = 0 gettimeofday({1481540024, 38456}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364795786}) = 0 gettimeofday({1481540024, 38534}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364874481}) = 0 gettimeofday({1481540024, 38613}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364953765}) = 0 gettimeofday({1481540024, 38692}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2)

And this happen for really too many lines. The only noticeable thing from the logs is something like this:

Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=connecting address=smtp+tls://152.163.0.99:25 host=mtain-a-atc-b.mx.aol.com Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=connected Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=error reason=Bad response: non-printable character in reply Dec 12 10:57:32 mx2 smtpd[4452]: smtp-out: Disabling route [] <-> 152.163.0.99 (mtain-a-atc-b.mx.aol.com) for 15s Dec 12 10:57:33 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 152.163.0.67 (mailin-05.mail.aol.com) Dec 12 10:57:33 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 64.12.88.132 (mtain-b-mtc-a.mx.aol.com) Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=connecting address=smtp+tls://152.163.0.67:25 host=mailin-05.mail.aol.com Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=connected Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=error reason=Bad response: non-printable character in reply Dec 12 10:57:34 mx2 smtpd[4452]: smtp-out: Disabling route [] <-> 152.163.0.67 (mailin-05.mail.aol.com) for 15s Dec 12 10:57:35 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 64.12.88.163 (mailin-05.mail.aol.com) Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=connecting address=smtp+tls://64.12.88.163:25 host=mailin-05.mail.aol.com Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=connected Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=error reason=Bad response: non-printable character in reply

And the lookup problem happen only when the AOL servers are started to being looked up by one of the machines.

Well, i already googled everything possible and the only thing is a potential problem with how the Linux network stack manage UDP sockets and was the cause of a similar problem in Libevent. Is this a known issue?

For more details, please contact me at andrea@tawk.to Regards

ghost commented 7 years ago

Side note: I noticed that there is a similarly closed issue in the bug tracker. I think that this is exactly the same

poolpOrg commented 7 years ago

strange.

Can you reproduce this easily or does it fall into 100% cpu every now and then ?

ghost commented 7 years ago

Probably i can reproduce it by trying to send an e-mail to an AOL address, but i'm not really sure.

BTW, this is the routes actually in OpenSMTPD:

[] <-> 64.12.91.195 (mailin-05.mail.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=2s
[] <-> 64.12.88.163 (mailin-05.mail.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=9s
[] <-> 152.163.0.67 (mailin-05.mail.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=1s
[] <-> 64.12.88.164 (mailin-05.mail.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=12s
[] <-> 196.43.129.11 (196.43.129.11) ---- nconn=2 nerror=0 penalty=0 timeout=-
[] <-> 152.163.0.100 (mtain-b-atc-b.mx.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=6s
[] <-> 152.163.0.68 (mailin-05.mail.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=2s
[] <-> 152.163.0.99 (mtain-a-atc-b.mx.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=4s
[] <-> 64.12.88.132 (mtain-b-mtc-a.mx.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=11s
[] <-> 64.12.88.131 (mtain-a-mtc-a.mx.aol.com) ND-- nconn=0 nerror=0 penalty=1 timeout=14s
[] <-> 96.45.82.196 (http-redirection-d.dnsmadeeasy.com) N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 96.45.83.38 (redirection.dnsmadeeasy.com) N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 64.4.6.100 N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 96.45.83.210 (redirection.dnsmadeeasy.com) ND-- nconn=0 nerror=0 penalty=1 timeout=5s
[] <-> 96.45.82.43 (redirection.dnsmadeeasy.com) N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 187.45.217.20 (mx.b.locaweb.com.br) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 104.197.5.224 (224.5.197.104.bc.googleusercontent.com) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 63.250.192.46 (mta-v6.mail.vip.gq1.yahoo.com) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 108.168.206.41 (server05.tradeholding.com) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 173.194.205.26 (qm-in-f26.1e100.net) ---- nconn=2 nerror=0 penalty=0 timeout=-
[] <-> 177.153.23.241 (mx.core.locaweb.com.br) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 204.141.32.121 (mx.zoho.com) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 168.144.155.48 (server.yofo.net) N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 74.125.22.26 (qh-in-f26.1e100.net) ---- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 74.63.218.50 (host.sh06.com) N--- nconn=1 nerror=0 penalty=0 timeout=-
[] <-> 54.232.92.235 (ec2-54-232-92-235.sa-east-1.compute.amazonaws.com) N--- nconn=1 nerror=0 penalty=0 timeout=-

Now this is in only one of the two MTA's. If the other MTA will start to lookup for AOL servers, it will reproduce the exact same behavior. The only thing i found is http://archives.seul.org/libevent/users/Jul-2013/msg00017.html that seems to trigger the same behavior.

ghost commented 7 years ago

In the first comment of this issue the strace you see is the one from the lookup process. Maybe i can provide it with more details? Suggestions?

poolpOrg commented 7 years ago

Yes sorry, I asked for the strace out of habit but I had it below my eyes, caffeine needed.

Eric and I are investigating and we'll have a diff (hopefully) or many (...) for you soon.

ghost commented 7 years ago

:+1: No problems, i'll wait for an answer

ghost commented 7 years ago

Addendum: Now also the other MTA started to show the same behavior. The cpu usage by the lookup process is stable between 90 and 100%

ghost commented 7 years ago

Hi.

Needed to change my account. I'm actually the one that reported this issue

poolpOrg commented 6 years ago

unable to reproduce and eric rewrote the resolver layer, closing this ticket, if anyone hits the issue again please reopen

ghost commented 6 years ago

Hi Gilles

To be fair, I never encountered this issue again, but I must say that I never had the possibility to test a newer opensmtpd-portable in the particular setup affected.

Sorry for not having the possibility to be more helpful on this.

poolpOrg commented 6 years ago

On Sat, Sep 01, 2018 at 07:11:42AM -0700, abiscuola wrote:

Hi Gilles

To be fair, I never encountered this issue again, but I must say that I never had the possibility to test a newer opensmtpd-portable in the particular setup affected.

Sorry for not having the possibility to be more helpful on this.

No worry, there is a fair possibility this got fixed in the many improvements we did in the last few months.

Worst case scenario, you experience this again and we'll fix :-)

-- Gilles Chehade

https://www.poolp.org @poolpOrg

ericfaurot commented 4 years ago

One year later... I just fixed a bug in libasr that may very well explain this issue. In fact I'm almost 100% sure it is the case.

OpenSMTPD / OpenSMTPD

Lookup process use 100% CPU in Debian #750