failure scenarios, monitoring and glibc recommendations

lukastribus commented 3 years ago

Hello,

I'm currently evaluating the FORT validator and have a few questions.

I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups, not specific to FORT) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.

I worry about how issues like:

crash bugs in the validation code
hangs during RPKI validation (even in rsync), that block the entire validation
memory allocation failures (failed malloc)
Linux OOM-killer

impact the RTR service:

The best-case scenario for me is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

I'm also thinking about monitoring (maybe without regex'ing logfile)

how to best monitor for periodic successful validation runs (thinking about how to trigger an ALIVE signal to something like healthchecks.io )
how to monitor validation run time (like healthchecks.io run time measurement )

Other than parsing strings from logfiles, how could we best achieve this? Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

pcarana commented 3 years ago

Hi Lukas! I hope this answer can help to your analysis.

Is that the expected behavior in FORT? It is one-process with multiple threads, so a crash would achieve this, correct?

Correct. A crash will cause FORT validator to stop, which means that also the RTR server will go down. The key point here is: when will FORT validator crash? Definitely due to bugs (hopefully, shouldn't be much of these, but nobody's perfect) or a programming error (logged at crit level, see Logging#level at our docs).

Regarding the other issues:

Memory allocation failures are handled and logged at the operation logs (so that the operator be notified about the issue). FORT will try to do its best to keep processing data; so in the worst case scenario, the current validation cycle data will be discarded, thus the RTR data shouldn't be modified and the RTR server won't die. If the problem persists, we rely on the operator to take action, since the logs will be full of "Out of memory" messages. Maybe this isn't what you expect, since the RTR server will still live. So, we'll consider your proposal regarding stop the whole process to avoid the clients (routers) to keep stale date until the operator do something regarding the FORT validator messages.
Hangs during RPKI validation. The most likely "hang" scenario could be at the http/rsync requests. Here are some of the configurable timeouts for these requests (http.connect-timeout, http.transfer-timeout, http.idle-timeout, rsync.arguments-recursive, rsync.arguments-flat). By default, once the connection is established FORT validator will try to fetch all the required data from the endpoint; if data is being transferred, the connection won't be killed, so FORT will wait until the data is fully transferred (unless the connection is terminated by a local network or remote issue). In other words, the RTR server won't die even if the validation takes a while to complete.

I'm also thinking about monitoring (maybe without regex'ing logfile)

how to best monitor for periodic successful validation runs

how to monitor validation run time

Oops! As of today this data is logged at a level info in the operation logs (you'll have to set it before running FORT, using --log.level=info, since the default level is warning).

Is there some stat socket that we could query to check for things like last validation start time and last validation completion time?

This will definitely be at our TODO list, so for now "regex'ing logfile" (using the info level) is the way.

Regarding glibc memory allocation: when using glibc, should we just use MALLOC_ARENA_MAX=2 always or only in environments with limited memory? If this is a good middle ground, I'd prefer to just use it always in the glibc world and have systemd unit files set this.

Yes, I would recommend to use it in environments with limited memory. Of course, there's no problem using it always, since its main goal is to help.

lukastribus commented 3 years ago

Thanks for the feedback.

I think a stat socket or better yet a HTTP endpoint with a REST interface or something, that returns general health and validation metrics (especially last validation run - start, time and stop) would indeed be important for active monitoring.

pcarana commented 3 years ago

I agree, this will be a nice (and likely needed) feature.

pcarana commented 3 years ago

Newer version v1.5.0 will try to attend some of the points stated at this issue.

lukastribus commented 3 years ago

Regarding monitoring, I will build a rtrdump based tool to check for stalled RTR endpoints (same RTR serial and data output after X amount of time = monitoring alert). I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Thanks

pcarana commented 3 years ago

I believe this is a better way to monitor validator/RTR server health than relying on validation timestamps from a HTTP API.

I agree, that's a good approach to monitor RTR server health.

Slightly off-topic: do you consider doing garbage collection based on the files not referenced in valid current manifests as opposed to rsync --delete/RRDP withdraw) ?

Well, I've read it a moment ago. Seems a good suggestion but I haven't discussed it with the team yet, so we need to analyze it in order to take a call on what to do.

ydahhrk commented 9 months ago

Status:

crash bugs in the validation code

As has been previously mentioned, Fort panics when it detects programming errors. Because the validator and RTR server are part of the same binary, validator errors also bring down the RTR server with it.

This has worked this way since the inception of the project.

hangs during RPKI validation (even in rsync), that block the entire validation

There are a few timeouts in place (1, 2, 3, 4, 5), but I still believe the implementation to be naive.

My lead concerns right now are a timeout to rsync invocations, as well as a timeout to the overall validation. After that, I would like to worry about researching whether it's possible to assign timeouts to I/O operations in the cache.

Additional ideas welcomed.

memory allocation failures (failed malloc)

As of 1.6.0, Fort generally panics on memory allocation failures. As you proposed, this is intended to prevent Fort from advertising incomplete information, regardless of what the environment thinks is an adequate response to a failed allocation. All mallocs outside of the asn1 code have been already wrapped.

I still consider this an ongoing effort however, because of the still pendig asn1 review, and also because some of Fort's dependencies sometimes obfuscate error causes. I don't know if there's a solution for the latter, other than ditching the dependency entirely.

I'm also thinking about monitoring (maybe without regex'ing logfile)

Embarrassingly, this is still meant to be addressed through the logs.

A Prometheus endpoint has branched off into issue #50, and I believe is the problem I will address next. The missing stats server is not only crippling production monitoring, but also profiling during development and testing.

So, in summary... not a whole lot of progress, yet. But this is rapidly becoming the lead of my worries.

NICMx / FORT-validator

failure scenarios, monitoring and glibc recommendations #40