NICMx / FORT-validator

RPKI cache validator
MIT License
49 stars 24 forks source link

Output ROA file is not generated if rsync hangs on one of the repositories #21

Closed kronby closed 4 years ago

kronby commented 4 years ago

Output ROA file is not generated if rsync hangs on one of the repositories. See attachment with descriptions. fort.conf.txt rsync-hangs.txt

pcarana commented 4 years ago

True, the output file is currently generated when the validation cycle is done. In this scenario when rsync hangs (which by the way we haven't noticed until now), the validation cycle isn't done yet, so there's no output file because FORT still doesn't "know" all the valid prefixes.

Apparently this is an rsync issue, but we should be prepared for it, so we will be monitoring to avoid this wrong behavior.

Just as additional data: I've seen the repo which was causing the issue, and (as of today) it seems that's misconfigured. Probably that's why the rsync hanged during the process.

pcarana commented 4 years ago

We've trying to reproduce the issue during the past few days, but still no luck at it. Does the issue keeps happening? Reading rsync known issues, the first one might be related to this issue. Just in case, and of course if you're willing to, could you follow the recommendations listed there and share the results? We'll keep trying to reproduce the error as well.

Our best approach (and for now, our "best guess" too) to avoid the issue, is to set an rsync timeout, adding the --timeout argument to the rsync command. This will be set as default as part of the upcoming release.

If you wish to give it a try, just add the argument to your /opt/fort/etc/fort.conf file at the elements rsync.arguments-recursive and rsync.arguments-flat. The rsync element should result in something like this:

{
...
             "rsync": {
               "program": "rsync",
               "arguments-recursive": [
                 "--recursive",
                 "--delete",
                 "--times",
                 "--contimeout=20",
                 "--timeout=20",
                 "$REMOTE",
                 "$LOCAL"
               ],
               "arguments-flat": [
                 "--times",
                 "--contimeout=20",
                 "--timeout=20",
                 "--dirs",
                 "$REMOTE",
                 "$LOCAL"
               ]
             },
...
}
kronby commented 4 years ago

The problem disappeared. It seems remote repository on host rpki.qs.nu restored. Output ROA file is correctly generated. I'm nothing done, just waited. Sure, If I will see hanging process again, I will capture process states and share results. I done as you write about '--timeout' parameter.

bg1#show bgp rpki server summary Tue Feb 11 14:36:01.506 MSK Hostname/Address Transport State Time ROAs (IPv4/IPv6) x.x.x.162 TCP:8323 ESTAB 00:32:09 110179/18535 <--- RIPE NCC Validator y.y.y.166 TCP:8323 ESTAB 00:03:23 110165/18530 <--- FORT

pcarana commented 4 years ago

It's good to know that the problem hasn't showed up and that you have FORT validator doing its job.

Thanks for reporting this issue and your willing to help.