OpenSMTPD / OpenSMTPD

This is official OpenSMTPD Portable repository. Forks, pull requests and other contributions are welcome!
http://www.opensmtpd.org
Other
487 stars 88 forks source link

Random process crashes on the (almost) lastest portable #929

Closed jirutka closed 4 years ago

jirutka commented 4 years ago

I’m running OpenSMTPD built from 772da22936c8d80f7ad3284ea7e5bdbfdbee2efb (+ this patch for musl compatibility) with OpenSSL 1.1.1c* on Alpine Linux 3.10 (musl libc) with filter opensmtpd-filter-rewrite-from. It’s randomly crashes during email relaying.

This is all it logs when running with -v:

Aug 21 14:23:36 no-reply mail.info smtpd[3456]: 8bf482cee62b7ea0 mta connecting address=smtp://10.38.1.1:25 host=relay.in.*****.cz
Aug 21 14:23:36 no-reply mail.info smtpd[3456]: 8bf482cee62b7ea0 mta connected
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta-routing: route [] <-> 10.38.1.1 (relay.in.*****.cz) is now valid.
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: connecting with [connector:[]->[relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz],0x20000]
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: cancelling connector timeout
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: enough connections already
Aug 21 14:23:36 no-reply mail.debug smtpd[3456]: debug: mta: 0x556c2071d0e0: handling next task for relay [relay:relay.in.*****.cz,port=25,smtp,mx,heloname=no-reply.in.*****.cz]
Aug 21 14:23:36 no-reply mail.debug smtpd[3454]: debug: control -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3454]: debug: control agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3453]: debug: ca -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3453]: debug: ca agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3446]: debug: parent -> control: pipe closed
Aug 21 14:23:36 no-reply mail.crit smtpd[3446]: smtpd: process control socket closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3458]: debug: scheduler -> control: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3458]: debug: scheduler agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3457]: debug: queue -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3457]: debug: queue agent exiting
Aug 21 14:23:36 no-reply mail.debug smtpd[3455]: debug: lka -> pony express: pipe closed
Aug 21 14:23:36 no-reply mail.debug smtpd[3455]: debug: lookup agent exiting

smtpd.conf:

# This is the smtpd server system-wide configuration file.
# See smtpd.conf(5) for more information.

# Full domain name of this mail server.
server_name = "no-reply.in.****.cz"

# Address of the relay server.
relay_host = "relay.in.****.cz"

# Table of hosts/networks that are allowed to relay mails.
table internal { "10.38.1.0/24", "127.0.0.1", "::1", "local" }

# Table of user credentials allowed to relay mails for any recipients.
table passwd "/etc/smtpd/passwd.table"

pki $server_name cert "/etc/ssl/smtpd/server.crt"
pki $server_name key "/etc/ssl/smtpd/server.key"

# Specify a maximum message size of n bytes.
# The argument may contain a multiplier, as documented in scan_scaled(3).
# The default maximum message size is 35MB if none is specified.
smtp max-message-size 1M

filter "rewrite-from" proc-exec "/etc/smtpd/filter-rewrite-from no-reply@****.cz"

listen on socket filter "rewrite-from"
listen on lo hostname $server_name filter "rewrite-from"

# Submission port for relaying mails from the internal network to the outside
# world. This port is accessible only from the internal network.
listen on eth0 port 587 tls \
    hostname $server_name \
    auth-optional <passwd> \
    received-auth \
    mask-src \
    filter "rewrite-from" \
    tag "internal"

action "discard" mda "/bin/true" virtual { "@" = "nobody" }
action "relay" relay host $relay_host helo $server_name

match from src <internal> \
    for domain { "drop.me", "example.org", "local.host" } \
    action "discard"

match from src <internal> \
    for domain "****.cz" \
    action "relay"

match auth action "relay"

* I wanted to build it against the last stable LibreSSL, but it doesn’t build due to some missing symbols…

/cc @poolpOrg

poolpOrg commented 4 years ago

Can you reproduce without your filter ?

jirutka commented 4 years ago

I don’t know how to reproduce it, it didn’t crash when I was trying to reproduce it. And I cannot let it run in production without that filter.

jirutka commented 4 years ago

At morning it crashed with a different error message:

Aug 21 08:59:32 no-reply mail.crit smtpd[2536]: smtpd: process ca socket closed
opensmtpd-github commented 4 years ago

On Wed, Aug 21, 2019 at 08:43:50AM -0700, Jakub Jirutka wrote:

At morning it crashed with a different error message:

Aug 21 08:59:32 no-reply mail.crit smtpd[2536]: smtpd: process ca socket closed

i'm without smtp out trafic at home, will check if i can setup an alpine on a vps with smtp-out somewhere and do some testing

-- Gilles Chehade @poolpOrg

https://www.poolp.org patreon: https://www.patreon.com/gilles

jirutka commented 4 years ago

I can provide you SSH access to some VM with Alpine installed if it would help you.

poolpOrg commented 4 years ago

That would be helpful yes if it can send mail to the outside, i'll setup something between that vm and my machines to try reproducing.

jirutka commented 4 years ago

gilles@185.8.164.31

jirutka commented 4 years ago

Just a note, I’ve created the VM in the playground environment, so it will be available only for 1 month.

jirutka commented 4 years ago

I’ve encountered a different crash, now from SSL library:

Aug 28 21:29:50 no-reply mail.info smtpd[4129]: 3d81d7318b47cbc7 smtp connected address=10.***** host=*****
Aug 28 21:29:50 no-reply mail.debug smtpd[4128]: debug: looking up pki "no-reply.in.*****.cz"
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: session_start_ssl: switching to SSL
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: pony: rsae_priv_enc
Aug 28 21:29:50 no-reply mail.debug smtpd[4129]: debug: SSL library error: io_dispatch_accept_tls:SSL_accept: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected message
Aug 28 21:29:50 no-reply mail.info smtpd[4129]: 3d81d7318b47cbc7 smtp disconnected reason="io-error: error:140943F2:SSL routines:ssl3_read_bytes:sslv3 alert unexpected message"

This version of OpenSMTPD is very unreliable for me. :(

jirutka commented 4 years ago

@poolpOrg, I’ve updated to 72758123b7c2c83efeefb12623bde59fe5b810ac (the current latest portable), but no change at all, it’s still randomly crashing.

poolpOrg commented 4 years ago

Yes, I haven't had time to check yet but this is high on my list

jirutka commented 4 years ago

Any change…?

poolpOrg commented 4 years ago

not yet but i've just entered my "opensource" week so I'm on it

poolpOrg commented 4 years ago

I'll address the two issues separately:

Regarding the initial crash, I've tried but I'm unable to reproduce on an alpine setup similar to yours both in terms of using OpenSSL and in terms of setup using a relay host. I'll keep it running a while and taking real trafic but not knowing if your filter caused it is an issue for troubleshooting as we've committed four days ago a diff to fix a race condition that could crash in filter layer. Do you still experience the crash ?

Regarding the SSL library stuff, you mentioned a different crash but your sample logs doesn't show a crash, what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation ?

jirutka commented 4 years ago

…not knowing if your filter caused it is an issue for troubleshooting…

If the filter caused a crash, I would expect OpenSMTPD to log it (“lost processor: exited abnormally” or “misbehaving filter”), but I don’t see any message regarding the filter.

we've committed four days ago a diff to fix a race condition that could crash in filter layer.

I’ll update my instance to the latest portable and let you know if something changed.

what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation?

IIRC it crashed right after this message, but I’m not 100% sure now. I’ve already fixed the problem on the client side trying to use SSLv3.

poolpOrg commented 4 years ago

If the filter caused a crash, I would expect OpenSMTPD to log it (“lost processor: dynproc:00000001 exited abnormally” or “misbehaving filter”), but I don’t see any message regarding the filter.

Not necessarily, when I said "if your filter caused a crash" I didn't mean it necessarily as your filter crashing, it could also be that your filter does something which causes OpenSMTPD itself to crash.

This is what happened in the race condition we fixed this week, where if a session was disconnected for some reason while a filter was working, upon response of the filter, the session would no longer exist and OpenSMTPD would fatal() assuming a corruption.

This was a crash in the daemon, so not logged as a lost processor, but which would only ever happen with a filter processing a specific phase.

I’ll update my instance to the latest portable and let you know if something changed.

good, I have committed your diff to includes.h so you don't need it anymore

what it shows is a TLS negotiation error. Did OpenSMTPD crash after that negotiation?

IIRC it crashed right after this message, but I’m not 100% sure now. I’ve already fixed the problem on the client side trying to use SSLv3.

OK, let me know how it goes, as far as I'm concerned, I have produced multiple TLS errors ranging from no TLS on the TLS listener, random data on the TLS listener, wrong ciphers on the TLS listener, and all I get is a disconnect with no crash.

jirutka commented 4 years ago

Today it crashed four times and every time with the following message (with different token and sid ofc):

lookup: Invalid filter message type: filter-dataline|b6d5dbf02113cec4|a8cfd22834b47a7f|Received:

I didn’t see this message before, so it’s new after the update.

I don’t understand what’s wrong with this particular message. I’m not modifying it in any way, just printing what I read on stdin.

Relevant lines from the filter:

"filter" == $1 {
        if (NF < 7) {                                               
                die("invalid filter command: expected >6 fields!")
        }
        sid = $6
        token = $7
        line = substr($0, length($1$2$3$4$5$6$7) + 8)

        # continue with next rule...
} 
"filter|smtp-in|data-line" == $1_$4_$5 {
        ...                       
        print("filter-dataline", token, sid, line)
}
poolpOrg commented 4 years ago

I think you're still running the code from 7275812 and not the latest code from the portable branch.

The error message you've shown can only trigger for that line if you're running the code from 7275812 as it contains a typo which I fixed in the very next commit:

https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.sbin/smtpd/lka_proc.c.diff?r1=1.9&r2=1.10&f=h

Can you update and confirm ?

jirutka commented 4 years ago

Ah, sorry, you’re right! I really did update OpenSMTPD, but to the package built from 7275812, not the latest one – I forgot to install the update after updating the package some time ago and didn’t notice the version number now.

Anyway, now I’m really running on the latest portable. I will let you know tomorrow if it crashed.

poolpOrg commented 4 years ago

cool

poolpOrg commented 4 years ago

Hi,

Any news ?

jirutka commented 4 years ago
Sep 18 11:35:08 no-reply mail.info smtpd[5449]: 9aeb49dd6a265771 smtp connected address=10.xxxx host=xxxxx
Sep 18 11:35:08 no-reply mail.debug smtpd[5448]: debug: looking up pki "no-reply.in.xxxx.cz"
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: session_start_ssl: switching to SSL
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: pony: rsae_priv_enc
Sep 18 11:35:08 no-reply mail.debug smtpd[5449]: debug: SSL library error: io_dispatch_accept_tls:SSL_accept: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown
Sep 18 11:35:08 no-reply mail.info smtpd[5449]: 9aeb49dd6a265771 smtp disconnected reason="io-error: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown"
Sep 18 11:35:09 no-reply mail.debug smtpd[5450]: debug: queue -> pony express: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5450]: debug: queue agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5451]: debug: scheduler -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5447]: debug: control -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5447]: debug: control agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5451]: debug: scheduler agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5446]: debug: ca -> control: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5446]: debug: ca agent exiting
Sep 18 11:35:09 no-reply mail.debug smtpd[5441]: debug: parent -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.crit smtpd[5441]: smtpd: process queue socket closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5448]: debug: lka -> queue: pipe closed
Sep 18 11:35:09 no-reply mail.debug smtpd[5448]: debug: lookup agent exiting
Sep 18 11:35:09 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5441, exited with return code 1

This one came out of nowhere:

Sep 18 15:37:04 no-reply mail.debug smtpd[5837]: debug: queue -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5837]: debug: queue agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5834]: debug: control -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5834]: debug: control agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5832]: debug: parent -> queue: pipe closed
Sep 18 15:37:04 no-reply mail.crit smtpd[5832]: smtpd: process queue socket closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5838]: debug: scheduler -> queue: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5838]: debug: scheduler agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5835]: debug: lka -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5835]: debug: lookup agent exiting
Sep 18 15:37:04 no-reply mail.debug smtpd[5833]: debug: ca -> pony express: pipe closed
Sep 18 15:37:04 no-reply mail.debug smtpd[5833]: debug: ca agent exiting
Sep 18 15:37:04 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5832, exited with return code 1
Sep 19 14:39:23 no-reply mail.info smtpd[5913]: 17be4dd4d0fd84bb smtp connected address=10.38.3.182 host=xxxx.cz
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: looking up pki "no-reply.in.fit.xxxxx.cz"
Sep 19 14:39:23 no-reply mail.debug smtpd[5913]: debug: session_start_ssl: switching to SSL
Sep 19 14:39:23 no-reply mail.debug smtpd[5914]: debug: queue -> pony express: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5914]: debug: queue agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: lka -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5910]: debug: ca -> pony express: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5910]: debug: ca agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5911]: debug: control -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5915]: debug: scheduler -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.debug smtpd[5911]: debug: control agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5912]: debug: lookup agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5915]: debug: scheduler agent exiting
Sep 19 14:39:23 no-reply mail.debug smtpd[5909]: debug: parent -> queue: pipe closed
Sep 19 14:39:23 no-reply mail.crit smtpd[5909]: smtpd: process queue socket closed
Sep 19 14:39:23 no-reply daemon.warn supervise-daemon[5439]: /usr/sbin/smtpd, pid 5909, exited with return code 1
jirutka commented 4 years ago

@poolpOrg, any news?

poolpOrg commented 4 years ago

nope, got context switched but this is still being investigated, whole october is on stabilization

poolpOrg commented 4 years ago

I have two possible candidates for this bug which would explain why I don't observe them.

Investigation still in progress.

poolpOrg commented 4 years ago

OK, I'm out of ideas.

I've setup a brand new alpine machine, built OpenSMTPD with OpenSSL 1.1.1c, I tested incoming and outgoing TLS, I tested DNS-resolved MX and relay host MX over TLS, received and sent thousands of mails and did not observe a single crash.

I need to be able to replicate your setup otherwise we're out of luck fixing this, I've seen your config but this is not enough for me to reproduce:

jdelic commented 4 years ago

@jirutka

any chance of running this with debug symbols compiled in and core dumps enabled?

jirutka commented 4 years ago

I’m using Alpine v3.10, there’s currently OpenSSL 1.1.1d. You can install the exact OpenSMTPD binary that I’m running from my packages:

cd /etc/apk/keys
wget https://raw.githubusercontent.com/jirutka/user-aports/v3.10/.keys/jakub@jirutka.cz-56d0d9fd.rsa.pub

echo "@jirutka https://repo.jirutka.cz/alpine/v3.10/user" > /etc/apk/repositories

apk add opensmtpd@jirutka opensmtpd-dbg@jirutka

It’s built from this APKBUILD.

does the crash happen when contacting specific hosts ?

It works only as a relay for clients on the local network and relays all the mails via single relay server. I don’t know if it happens only with some clients, I may check it, but there are only about 2-3 active clients anyway…

any chance of running this with debug symbols compiled in and core dumps enabled?

I thought that core dumps are generated only when the process crash, e.g. with segfault, isn’t it? This doesn’t seem to be the case, according to the logs OpenSMTPD just decides to quit…

(I’ve installed debug symbols and enabled core dumps now.)

jdelic commented 4 years ago

@jirutka

I thought that core dumps are generated only when the process crash, e.g. with segfault, isn’t it? This doesn’t seem to be the case, according to the logs OpenSMTPD just decides to quit…

Sorry, I was under the impression that the filter crashes and that takes OpenSMTPD with it. You're right of course.

jirutka commented 4 years ago

Any news?

jirutka commented 4 years ago

Now I found process smtpd: pony express consuming 100% of CPU and OpenSMTPD not responding on the submission port. This version is really unreliable as hell. :(

poolpOrg commented 4 years ago

musl-based distros have never been very well supported, hopefully the Docker work being done by @ngortheone will help us with this.

jirutka commented 4 years ago

musl is not some strange libc, it’s strictly standard-compliant libc implementation. If some software crahes or even hangs on musl, it’s more a litmus paper that there’s someting wrong in it…

I’ll try to build OpenSMTPD against LibreSSL, maybe the problem is in compatibility with OpenSSL. I hope that it will work now because the last time I was not able to build it even against the latest stable LibreSSL (missing symbols)…

whataboutpereira commented 4 years ago

FYI I'm running OpenSMTPD with OpenSSL 1.1.1d without problems.

jirutka commented 4 years ago

BTW, I’m running OpenSMTPD 6.0.3p1 on Alpine Linux (with LibreSSL) for a long time without a single crash. The problems started just with the latest version.

poolpOrg commented 4 years ago

There are differences between the libc in the different BSD systems, there's a difference between the OpenBSD libc and the glibc, none of them are strange, they are just not the same and sometimes you hit bugs in libc or bugs in an application that doesn't happen on any but one libc due to some internal behaviour. This is not as simple and unified as you think, even if we didn't support Linux at all but only FreeBSD, we'd still have to account for libc differences (did you know that closefrom() system call doesn't even have the same prototype between OpenBSD and FreeBSD ? that strnvis() had swapped arguments between FreeBSD or NetBSD at some point ?).

If software crashes or hangs on musl, it could be due to musl (wouldn't be the first time), it could be due to OpenSMTPD (wouldn't be the first time), could be due to either one of the dependency (wouldn't be the first time) and given that a minority of people use musl distros as far as we see, it's clearly not given the same amount of testing that any BSD libc or glibc. We'll make it work but we don't have many testers and if we can't reproduce, unless some people really want to help, it's going to take more time than bugs that trigger on every distro out there.

As for the LibreSSL symbols, as explained on the mailing list and blog, to bring back code to a state that allows it to build for both OpenSSL and LibreSSL, I had to submit code to LibreSSL and refactor code in OpenSMTPD to work for both. This means there was a transitional time where the code would only work for a development version of LibreSSL that had new functions. The LibreSSL 3.0.2 stable release is the first that has what's necessary for us to build on both LibreSSL and OpenSSL.

poolpOrg commented 4 years ago

BTW, I’m running OpenSMTPD 6.0.3p1 on Alpine Linux (with LibreSSL) for a long time without a single crash. The problems started just with the latest version.

There's been quite a few commits over the course of a year, which is why I keep calling for test regularly because if a regression was introduced on musl-based distros between 6.0.3p1 and today, there's now >150 commits and 71 changed files to investigate, a year worth of commits to bisect:

https://github.com/OpenSMTPD/OpenSMTPD/compare/branch_opensmtpd-6.0.3...portable

And that's assuming the issue is actually in OpenSMTPD and not in libasr (there's been a new release recently), libressl (there's been a new release since 6.0.3p1) or even musl.

I want to support Alpine and other musl-based distros, but I can't be expected to run and test all distros all of the time, so unless someone commits to test development code regularly and let us know when there's a regression, the quality for these distros is not going to improve. I can't bisect a year worth of commits because the last known working version is a year old on such an active project, it just means that no one really cared for a year and now I'm expected to. I really focus on the systems and distros for which I have frequent feedback.

For the moment, it looks to me that OpenSMTPD 6.6.0p1 will not support Alpine because I've had enough reports that it works on other distros (Arch, CentOS, Ubuntu, Gentoo, ...) and systems (FreeBSD, NetBSD, OSX, ...) that I'm confident the release will be stable for them. I'll continue tracking this issue post release and will publish a minor release if/when the Alpine situation is solved, but I won't delay the release for more than a week.

I really want to stress it again for future readers:

Do not JUST install stable releases, test development code otherwise there's NO stable release.

ngortheone commented 4 years ago

@jirutka there were multiple fixes lately to portable branch. There is a good chance that the problem you were experiencing is fixed.

Btw. are you maintaining Alpine Linux opensmtpd package?

poolpOrg commented 4 years ago

There's a possibility the crash was related to the fixes I applied, I'll wait for @jirutka to let us know if he still experiences a crash before rolling the new release.

jirutka commented 4 years ago

There is a good chance that the problem you were experiencing is fixed.

No, I’ve just upgraded and it’s still killing itself. >_<

Btw. are you maintaining Alpine Linux opensmtpd package?

I’m not its official maintainer, but I’m Alpine dev and I’m contributing to the opensmtpd package.

poolpOrg commented 4 years ago

Are you around and can you join IRC ?

I have an hour to spare, if you are able to reproduce the crash and generate a core dump, then I can work from that to debug and produce a fix asap.

jirutka commented 4 years ago

Are you around and can you join IRC ?

Yes.

if you are able to reproduce the crash and generate a core dump

There’s no crash! That’s the whole point all the time, OpenSMTPD is not crashing, it’s quitting itself without logging the reason.

ngortheone commented 4 years ago

Check dmesg. You might see segfault there

On November 4, 2019 8:31:31 AM EST, Jakub Jirutka notifications@github.com wrote:

Are you around and can you join IRC ?>

Yes.>

if you are able to reproduce the crash and generate a core dump>

There’s no crash! That’s the whole point all the time, OpenSMTPD is not crashing, it’s quitting itself without logging the reason.>

-- > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> https://github.com/OpenSMTPD/OpenSMTPD/issues/929#issuecomment-549353973

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

poolpOrg commented 4 years ago

nah this is more subtle, there is a child process crashing, the exit 01 message is because the parent process exits with status 1 if it detects a child has disappeared.

I'm on #OpenSMTPD @ freenode, hit me up and we'll figure this out fast.

jirutka commented 4 years ago

Check dmesg. You might see segfault there

Uh, you’re right! There really are segfaults. So I have somehow broken logging on this system and haven’t noticed it until now (always looked only into /var/log/dmesg). But then I don’t understand why core dumps are not generated; I’ve already verified that core dumps work on this system. o.O

[796819.036664] smtpd[16067]: segfault at 55d26ea3a0ba ip 00007f41c4120527 sp 00007ffe09466600 error 6 in ld-musl-x86_64.so.1[7f41c40ff000+46000]
[796819.036675] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[796819.895016] smtpd[16077]: segfault at 55667af30dba ip 00007f401f774527 sp 00007fffb0579e20 error 6 in ld-musl-x86_64.so.1[7f401f753000+46000]
[796819.895030] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[797052.980022] smtpd[16087]: segfault at 555c3f46231a ip 00007f6688ccf527 sp 00007ffd60eaa1b0 error 6 in ld-musl-x86_64.so.1[7f6688cae000+46000]
[797052.980036] Code: 47 28 48 85 c0 74 08 48 ff c5 48 89 c7 eb ef 48 0f bf 47 50 48 89 fb 48 6b c0 58 48 29 c3 4c 8d 63 4c 4c 89 e7 e8 15 94 01 00 <66> 29 6b 52 75 0c 48 89 df 5b 5d 41 5c e9 17 eb fd ff 5b 4c 89 e7
[798451.917013] smtpd[16097]: segfault at ffff0000 ip 00000000ffff0000 sp 00007fffe0b02008 error 14 in smtpd[5628e277e000+9000]
[798451.917061] Code: Bad RIP value.
poolpOrg commented 4 years ago

you may have them disabled by default, can you try to ulimit -c unlimited to increase the core size ?

jirutka commented 4 years ago

you may have them disabled by default, can you try to ulimit -c unlimited to increase the core size ?

I’ve already set this for opensmtpd. I suspect OpenRC’s supervise-daemon that it maybe doesn’t propagate it correctly. :/

poolpOrg commented 4 years ago

With help from @jirutka the issue seems to be properly understood:

A recent change in musl's freeaddrinfo() causes it to no longer be NULL-safe and segv when passed a NULL pointer. This explains why the crashes happen at various places, it also explains why I couldn't reproduce on an alpine myself (with a musl prior to the change) and on other systems (NULL-safe in glibc, BSD, ...).

Right now, @jirutka is running with a diff to confirm the issue but given that I read the implementations for freeaddrinfo() on different libc and that we have been able to reproduce the crash with surrounding logs, I'm 100% confident this is the issue.

For the next release, I think the only solution for Alpine is to ship a patch in their package to NULL-protect all freeaddrinfo() calls until we sort out if this is going to stay or be reverted, or at the very least until the next major release if we decide to NULL-protect upstream.

ngortheone commented 4 years ago

Awesome diagnosis @poolpOrg

Currently OpenSMTPD package in alpine is unmaintained (Jonathan Curran told me that he is no longer maintaining the package)

It was on my to-do list to update Alpine's package, so I can add the patch too.

jirutka commented 4 years ago

Thanks a lot to @poolpOrg for diagnosis!

Currently OpenSMTPD package in alpine is unmaintained (Jonathan Curran told me that he is no longer maintaining the package)

Okay, I’m gonna take over maintainership.