Closed ghost closed 6 years ago
Side note: I noticed that there is a similarly closed issue in the bug tracker. I think that this is exactly the same
strange.
Can you reproduce this easily or does it fall into 100% cpu every now and then ?
Probably i can reproduce it by trying to send an e-mail to an AOL address, but i'm not really sure.
BTW, this is the routes actually in OpenSMTPD:
Now this is in only one of the two MTA's. If the other MTA will start to lookup for AOL servers, it will reproduce the exact same behavior. The only thing i found is http://archives.seul.org/libevent/users/Jul-2013/msg00017.html that seems to trigger the same behavior.
In the first comment of this issue the strace you see is the one from the lookup process. Maybe i can provide it with more details? Suggestions?
Yes sorry, I asked for the strace out of habit but I had it below my eyes, caffeine needed.
Eric and I are investigating and we'll have a diff (hopefully) or many (...) for you soon.
:+1: No problems, i'll wait for an answer
Addendum: Now also the other MTA started to show the same behavior. The cpu usage by the lookup process is stable between 90 and 100%
Hi.
Needed to change my account. I'm actually the one that reported this issue
unable to reproduce and eric rewrote the resolver layer, closing this ticket, if anyone hits the issue again please reopen
Hi Gilles
To be fair, I never encountered this issue again, but I must say that I never had the possibility to test a newer opensmtpd-portable in the particular setup affected.
Sorry for not having the possibility to be more helpful on this.
On Sat, Sep 01, 2018 at 07:11:42AM -0700, abiscuola wrote:
Hi Gilles
To be fair, I never encountered this issue again, but I must say that I never had the possibility to test a newer opensmtpd-portable in the particular setup affected.
Sorry for not having the possibility to be more helpful on this.
No worry, there is a fair possibility this got fixed in the many improvements we did in the last few months.
Worst case scenario, you experience this again and we'll fix :-)
-- Gilles Chehade
https://www.poolp.org @poolpOrg
One year later... I just fixed a bug in libasr that may very well explain this issue. In fact I'm almost 100% sure it is the case.
Good morning.
We use OpenSMTPD portable for relaying the e-mails for our LiveChat service. We have a configuration with two servers running OpenSMTPD (6.0.2p1) in Debian 8 (jessie). For some motivations (and it seems to be related to some bad DNS answers from AOL), the lookup process start to eat 100% CPU. The machine remain responsive with a load average around 1, there is no service outage, but it's maybe a problem somewhere else.
Here what's happening from an strace of the process:
epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364557639}) = 0 gettimeofday({1481540024, 38296}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364636631}) = 0 gettimeofday({1481540024, 38375}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364717188}) = 0 gettimeofday({1481540024, 38456}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364795786}) = 0 gettimeofday({1481540024, 38534}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364874481}) = 0 gettimeofday({1481540024, 38613}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2) = 0 epoll_ctl(11, EPOLL_CTL_ADD, 17, {EPOLLIN, {u32=17, u64=17}}) = 0 epoll_wait(11, {{EPOLLIN, {u32=17, u64=17}}}, 32, 5000) = 1 clock_gettime(CLOCK_MONOTONIC, {1124087, 364953765}) = 0 gettimeofday({1481540024, 38692}, NULL) = 0 epoll_ctl(11, EPOLL_CTL_DEL, 17, 7ffc22a147e0) = 0 read(17, "", 2)
And this happen for really too many lines. The only noticeable thing from the logs is something like this:
Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=connecting address=smtp+tls://152.163.0.99:25 host=mtain-a-atc-b.mx.aol.com Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=connected Dec 12 10:57:32 mx2 smtpd[4452]: 22a7651b2ef713d0 mta event=error reason=Bad response: non-printable character in reply Dec 12 10:57:32 mx2 smtpd[4452]: smtp-out: Disabling route [] <-> 152.163.0.99 (mtain-a-atc-b.mx.aol.com) for 15s Dec 12 10:57:33 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 152.163.0.67 (mailin-05.mail.aol.com) Dec 12 10:57:33 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 64.12.88.132 (mtain-b-mtc-a.mx.aol.com) Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=connecting address=smtp+tls://152.163.0.67:25 host=mailin-05.mail.aol.com Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=connected Dec 12 10:57:34 mx2 smtpd[4452]: 22a7651c4b043cab mta event=error reason=Bad response: non-printable character in reply Dec 12 10:57:34 mx2 smtpd[4452]: smtp-out: Disabling route [] <-> 152.163.0.67 (mailin-05.mail.aol.com) for 15s Dec 12 10:57:35 mx2 smtpd[4452]: smtp-out: Enabling route [] <-> 64.12.88.163 (mailin-05.mail.aol.com) Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=connecting address=smtp+tls://64.12.88.163:25 host=mailin-05.mail.aol.com Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=connected Dec 12 10:57:36 mx2 smtpd[4452]: 22a7651d4dcd89a6 mta event=error reason=Bad response: non-printable character in reply
And the lookup problem happen only when the AOL servers are started to being looked up by one of the machines.
Well, i already googled everything possible and the only thing is a potential problem with how the Linux network stack manage UDP sockets and was the cause of a similar problem in Libevent. Is this a known issue?
For more details, please contact me at andrea@tawk.to Regards