Luzilla / dnsbl_exporter

Prometheus compatible exporter to query DNSBLs/RBLs.
https://www.luzilla-capital.com/
Other
32 stars 8 forks source link

Recurring SIGSEGV #64

Closed andbuitra closed 3 years ago

andbuitra commented 3 years ago

Hello,

We deployed dnsbl_exporter on a CentOS 7 machine as a systemd service. It's currently going offline pretty often complaining about memory (either oom or sigsegv). This is the error:

feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: panic: runtime error: invalid memory address or nil pointer dereference
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4640e7]
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: goroutine 316739 [running]:
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: github.com/luzilla/dnsbl_exporter/collector.(*Rbl).lookup(0xc000330510, 0xc00019a120, 0x12, 0xc00019a1e0, 0x1d, 0x1, 0x1, 0x0)
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: /home/runner/work/dnsbl_exporter/dnsbl_exporter/collector/rbl.go:147 +0x3bd
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: github.com/luzilla/dnsbl_exporter/collector.(*Rbl).Update.func1(0xc0002da1b0, 0xc000330510, 0xc00019a120, 0x12, 0xc00019a1e0, 0x1d)
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: /home/runner/work/dnsbl_exporter/dnsbl_exporter/collector/rbl.go:166 +0x113
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: created by github.com/luzilla/dnsbl_exporter/collector.(*Rbl).Update
feb 13 04:31:34 monitor2-co dnsbl_exporter[685]: /home/runner/work/dnsbl_exporter/dnsbl_exporter/collector/rbl.go:161 +0xf0
feb 13 04:31:34 monitor2-co systemd[1]: dnsbl_exporter.service: main process exited, code=exited, status=2/INVALIDARGUMENT
feb 13 04:31:34 monitor2-co systemd[1]: Unit dnsbl_exporter.service entered failed state.
feb 13 04:31:34 monitor2-co systemd[1]: dnsbl_exporter.service failed.

There's plenty of memory available (more than 6 GB) so this shouldn't be an issue. So far I've resorted to configure auto restart for the systemd unit. If relevant, the log also shows plenty of these:

feb 13 04:17:35 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:17:35-05:00" level=error
feb 13 04:17:39 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:17:39-05:00" level=error
feb 13 04:19:35 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:19:35-05:00" level=error
feb 13 04:19:35 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:19:35-05:00" level=error
feb 13 04:25:36 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:25:36-05:00" level=error
feb 13 04:25:36 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:25:36-05:00" level=error
feb 13 04:29:36 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:29:36-05:00" level=error
feb 13 04:29:36 monitor2-co dnsbl_exporter[685]: time="2021-02-13T04:29:36-05:00" level=error

There's nothing too special about our config. The only thing is that we load the RBLs and targets (using the proper args with absolute paths) from a folder that is linked to a git repo.

till commented 3 years ago

@andbuitra Thanks for reporting, can you share config and version? I'll try to reproduce. Sounds like something is missing.

andbuitra commented 3 years ago

The configuration is pretty straightforward

[Unit]
Description=DNSBL Exporter
StartLimitBurst=5

[Service]
User=root
ExecStart=/root/prometheus-monitoring/dnsbl_exporter/dnsbl_exporter --config.dns-resolver [REDACTED] --config.rbls /root/prometheus-monitoring/config-files/dnsbl_exporter/rbls.ini --config.targets /root/prometheus-monitoring/config-files/dnsbl_exporter/targets.ini
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=default.target

The version used is the latest release

./dnsbl_exporter --version
dnsbl-exporter version 0.4.3
till commented 3 years ago

@andbuitra Sorry, I meant rbls.ini and possibly targets.ini. I am assuming something is missing, and I don't handle input correctly.

andbuitra commented 3 years ago

Hello

The rbls.ini is as follows

[rbl]
server=cbl.abuseat.org
server=bl.deadbeef.com
server=spamtrap.drbl.drand.net
server=spamsources.fabel.dk
server=0spam.fusionzero.com
server=mail-abuse.blacklist.jippg.org
server=dyna.spamrats.com
server=noptr.spamrats.com
server=spam.spamrats.com
server=dnsbl.sorbs.net
server=spam.dnsbl.sorbs.net
server=bl.spamcop.net
server=pbl.spamhaus.org
server=sbl.spamhaus.org
server=xbl.spamhaus.org
server=ubl.unsubscore.com
server=dnsbl-1.uceprotect.net
server=dnsbl-2.uceprotect.net
server=dnsbl-3.uceprotect.net
server=db.wpbl.info
server=access.redhawk.org
server=sbl-xbl.spamhaus.org
server=b.barracudacentral.org
server=dul.dnsbl.sorbs.net
server=http.dnsbl.sorbs.net
server=l1.spews.dnsbl.sorbs.net
server=l2.spews.dnsbl.sorbs.net
server=misc.dnsbl.sorbs.net
server=postmaster.rfc-ignorant.org
server=rbl.spamlab.com
server=rbl.suresupport.com
server=relays.bl.kunden.de
server=smtp.dnsbl.sorbs.net
server=socks.dnsbl.sorbs.net
server=zen.spamhaus.org
server=zombie.dnsbl.sorbs.net
server=truncate.gbudb.net

Targets follows this pattern

[targets]
server=smtp.example1.com
server=smtp.example2.com
till commented 3 years ago

@andbuitra I'll check it on the weekend 🙏🏼

till commented 3 years ago

@andbuitra I haven't made much progress. Can you add --log.debug to your systemd unit and see if it uncovers anything? It's a bit noisy, but it would help.

My guess is that it's something inside the RBL requesting and response parsing. Or maybe even in a dependency.

till commented 3 years ago

@andbuitra release is here: https://github.com/Luzilla/dnsbl_exporter/releases/tag/0.4.4

andbuitra commented 3 years ago

@till I completely forgot about this. I will test it on the next couple of days. Thank you!

till commented 3 years ago

Yeah, let me know how it goes. I think I'll wait a bit until I merge the updated dependency again. Trying to think what else can be done to track this.

till commented 3 years ago

Btw, if you happen to narrow it down to a host/RBL combo, I can write a test confirming it against the upstream dependency and see about fixing it there.

till commented 3 years ago

@andbuitra friendly ping. Did you have a chance to take a look?

till commented 3 years ago

@andbuitra Do you see this happening still? I am currently prepping for a 0.5.0 release.

Btw, I'd like to include service files. Do you feel like contributing your's? With location a la man here would be preferred.

andbuitra commented 3 years ago

@till Apologies, I was on vacation. I haven't been able to test the package yet but I will now. My systemd unit is simple and it loads the config file from a local git repo; the unit is located at /etc/systemd/system/dnsbl_exporter.service but I have seen other apps like MariaDB putting them on /usr/lib/... and then referencing them. Maybe there's a standard for it by the freedesktop. The restart clause was put to mitigate the original issue.

I will test the release 0.4.4 and let you know if the issue happens again.

till commented 3 years ago

Here is a 0.4.4-next: dnsbl-exporter-linux-amd64-0.4.4-next.zip

If you want to build it yourself, you'll need goreleaser and a clone of this repo: make build.

till commented 3 years ago

I kinda just spotted something else.

Sometimes parsing IPs seems to fail. Why, not sure, but if it's nil. Code panics.

till commented 3 years ago

So, I can't figure out why this may happen to begin with, but now I should not panic but instead give you a log message about the "string" which it can't determine if it's an IP(v4 or v6).

till commented 3 years ago

Latest main branch:

dnsbl_exporter_0.4.4-next_Darwin_arm64.tar.gz dnsbl_exporter_0.4.4-next_Darwin_x86_64.tar.gz

I think this contains an actual fix. So it was not in a dependency, but my use of Go. If I don't hear back from you, I'll release 0.5.0 towards the weekend.

andbuitra commented 3 years ago

@till I was deploying this but I see this is the Darwin binary and we run Linux on the server. Could you build that latest version for linux x86? I will remove the restarts on my systemd unit so it won't fix itself automatically.

till commented 3 years ago

Sorry, here is Linux:

dnsbl_exporter_0.4.4-next_Linux_arm64.tar.gz dnsbl_exporter_0.4.4-next_Linux_i386.tar.gz dnsbl_exporter_0.4.4-next_Linux_x86_64.tar.gz

andbuitra commented 3 years ago

I have installed it now and so far so good. I will report back if the issue shows up again

andbuitra commented 3 years ago

@till No crashes as of now. I believe that panic was causing the binary to stop. It's been working normally for more than 12 hours without needing to reboot

till commented 3 years ago

@andbuitra Thanks for letting me know.

You catch anything in the logs? I am curious what kind of "ip" caused this.

andbuitra commented 3 years ago

Nothing special shows up. The only error is "level=error msg="read udp 127.0.0.1:37474->:0: read: connection refused" that shows up multiple times every minute but don't really know what it's about since the monitor works fine (as in metrics show up correctly)

till commented 3 years ago

Maybe you filter udp? DNS uses both (tcp and udp). If you have a local resolver it should respond to both.

andbuitra commented 3 years ago

It could be being filtered by upstream firewall. The resolver is in a public network and it's used throughout the infrastructure. dnsbl operates on a server behind a firewall using it as a gateway so that could be the reason. However, it's a non issue since the exporter is working just fine.

The exporter has been running for more than three days now with no issues. I believe this issue can be closed

till commented 3 years ago

Ok, good to know! I'll close when I cut a release. I am trying to finish #84 first! :) Thanks again for your time and patience.

till commented 3 years ago

@andbuitra I finally released 0.5.0, thanks again for your help and patience. I put your unit into #86. If you have time to contrib a more general unit file, let me know.