certtools / intelmq

IntelMQ is a solution for IT security teams for collecting and processing security feeds using a message queuing protocol.
https://docs.intelmq.org/latest/
GNU Affero General Public License v3.0
977 stars 296 forks source link

gethostbyname expert: domain resolution failure behaviour #1553

Closed chorsley closed 4 years ago

chorsley commented 4 years ago

On a new IntelMQ instance, we're currently processing the Phishtank feed. The feed data includes a lot of very old URLs back to 2017, and many of the hostnames in there no longer resolve (i.e. NXDOMAIN) as you'd expect.

In the gethostbyname expert, socket.gethostbyname() returns a -3 result for these. Since -3 is not included in the expected result codes at https://github.com/certtools/intelmq/blob/develop/intelmq/bots/experts/gethostbyname/expert.py#L41, the bot raises an error, exits, waits 15 seconds, then restarts three times per non-resolving hostname. That creates a large pipeline bottleneck at the gethostbyname expert step when we're trying to process a large backlog of possibly non-resolving hostnames.

The easy fix would be to add -3 into line 41, i.e. if exc.args[0] in [-2, -3, -4, -5, -8, -11]: so that it bot could just move on smoothly without raising an error.

This seems to have been discussed before (e.g. https://github.com/certtools/intelmq/issues/1216) without resolution. I'm tempted to just make this change myself since it's proving a large drag on processing performance, but am wondering if there's any reasons identified NOT to do it?

ghost commented 4 years ago

As @vr0al already correctly stated in #1216 EAI_AGAIN / -3 means Temporary failure in name resolution. which is not a permanent error. So in this case you probably got some domains which permanently trigger temporary errors. As temporary failures are usually temporary and need to be fixed, just ignoring them is IMHO not the right way.

Detecting this kind of permanent temporary failures is a lot of work, so the realistic option would be to ignore these kind of errors and log a warning.

ghost commented 4 years ago

the bot raises an error, exits, waits 15 seconds, then restarts three times per non-resolving hostname. That creates a large pipeline bottleneck at the gethostbyname expert step when we're trying to process a large backlog of possibly non-resolving hostnames.

You can change this behavior: https://github.com/certtools/intelmq/blob/develop/docs/User-Guide.md#error-handling

ghost commented 4 years ago

Any opinions on what I wrote @chorsley ?

I suggest to optionally (by parameter, opt-in for backwards-compat) ignore this temporary error.

ghost commented 4 years ago

As I got no feedback I implemented it thay way now