NICMx / FORT-validator

RPKI cache validator
MIT License
47 stars 23 forks source link

FORT 1.5.3 Crashing - ERR: Unknown protocol: 114 #83

Closed InsaneSplash closed 7 months ago

InsaneSplash commented 2 years ago

Hello,

I am picking up that the latest version of FORT 1.5.3 keeps crashing on a regular basis. We has paired FORT with FRRouting which is also running on the latest version on Oracle Linux V8

fort-1.5.3-1.el8.x86_64
frr-8.2.2-02.el8.x86_64

Below is the extract from the log showing the crashed process.

May 16 12:37:03 fort[5745]: ERR: Unknown protocol: 114
May 16 12:37:03 fort[5745]: Stack trace:
May 16 12:37:03 fort[5745]:  /usr/bin/fort(print_stack_trace+0x1f) [0x417e5f]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(pr_crit+0x81) [0x4194e1]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x433d95]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x43168d]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(compute_deltas+0x46) [0x4336c6]
May 16 12:37:03 fort[5745]:  /usr/bin/fort() [0x43440d]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(vrps_update+0x110) [0x434b80]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(validation_run_cycle+0x29) [0x41d729]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(main+0x16c) [0x413e6c]
May 16 12:37:03 fort[5745]: Expand failed !
May 16 12:37:03 fort[5745]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f337ec3e493]
May 16 12:37:03 fort[5745]:  /usr/bin/fort(_start+0x2e) [0x413e9e]
May 16 12:37:03 fort[5745]: (End of stack trace)
May 16 12:37:03 systemd[1]: fort.service: Main process exited, code=exited, status=255/n/a
May 16 12:37:03 systemd[1]: fort.service: Failed with result 'exit-code'.
ydahhrk commented 2 years ago

I uploaded a small patch. I don't think it's going to solve the problem, but you might as well try it.

Are you using --output.roa?

If you enable it, do you get a slightly different error mesage?

Can you please post your fort command, with flags (and configuration file, if applies) included?

InsaneSplash commented 2 years ago

Hey, sorry for the late reply..... another instance just crashed.

Command Line: /usr/bin/fort --configuration-file /etc/fort/config.json

Config file:

{
        "tal": "/etc/fort/tal",
        "local-repository": "/var/lib/fort/repository",
        "slurm": "/etc/fort/slurm",
        "server": {
                "port": "3323",
                "interval": {
                        "validation": 3600,
                        "refresh": 3600,
                        "retry": 600,
                        "expire": 7200
        }
        },
        "log": {
                "output": "syslog"
        }
}
InsaneSplash commented 2 years ago
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x417d97]
May 27 07:59:16 fort[98190]: /lib64/libpthread.so.0(+0x12c30)[0x7f6d27f1cc30]
May 27 07:59:16 fort[98190]: /usr/bin/fort(x509_name_put+0x0)[0x427dc0]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4143cc]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4144ac]
May 27 07:59:16 fort[98190]: /usr/bin/fort(deferstack_pop+0x3b)[0x4146eb]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x428cc4]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x4296c9]
May 27 07:59:16 fort[98190]: /usr/bin/fort[0x437307]
May 27 07:59:16 fort[98190]: /lib64/libpthread.so.0(+0x818a)[0x7f6d27f1218a]
May 27 07:59:16 fort[98190]: /lib64/libc.so.6(clone+0x43)[0x7f6d27c41dd3]
InsaneSplash commented 2 years ago

Interesting the process provides a stack trace if you provide it a unknown option.

May 31 10:16:17 fort[916765]: ERR: Unrecognized option: 63
May 31 10:16:17 fort[916765]: Stack trace:
May 31 10:16:17 fort[916765]:  fort(print_stack_trace+0x1f) [0x417e5f]
May 31 10:16:17 fort[916765]:  fort(__pr_op_err+0x84) [0x418424]
May 31 10:16:17 fort[916765]:  fort(handle_flags_config+0x315) [0x416145]
May 31 10:16:17 fort[916765]:  fort(main+0x66) [0x413d66]
May 31 10:16:17 fort[916765]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f59ef759493]
May 31 10:16:17 fort[916765]:  fort(_start+0x2e) [0x413e9e]
May 31 10:16:17 fort[916765]: (End of stack trace)
May 31 10:16:17 fort[916765]: ERR: Try 'fort --usage' or 'fort --help' for more information.
May 31 10:16:17 fort[916765]: Stack trace:
May 31 10:16:17 fort[916765]:  fort(print_stack_trace+0x1f) [0x417e5f]
May 31 10:16:17 fort[916765]:  fort(__pr_op_err+0x84) [0x418424]
May 31 10:16:17 fort[916765]:  fort(handle_flags_config+0x33b) [0x41616b]
May 31 10:16:17 fort[916765]:  fort(main+0x66) [0x413d66]
May 31 10:16:17 fort[916765]:  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f59ef759493]
May 31 10:16:17 fort[916765]:  fort(_start+0x2e) [0x413e9e]
May 31 10:16:17 fort[916765]: (End of stack trace)
kmisak commented 2 years ago

I also getting this crash regularly, but with Unknown protocol: 0 index

InsaneSplash commented 2 years ago

Ive left the service running with no BGP services using it and lost 2 instances this weekend.

note: dont update librtr to version 8

ydahhrk commented 2 years ago

Do you have files in the SLURM directory? (/etc/fort/slurm) If so, can I have them? (It's fine if you want to censor IPs)

ydahhrk commented 2 years ago

Ok, it looks like this is going to be a difficult bug.

Is either of you willing to run a custom debug-heavy Fort binary?

kmisak commented 2 years ago

I will do that, no problem

InsaneSplash commented 2 years ago

This is all I have in that file

{
  "slurmVersion": 1,
  "validationOutputFilters": {
    "prefixFilters": [],
    "bgpsecFilters": []
  },
  "locallyAddedAssertions": {
    "prefixAssertions": [],
    "bgpsecAssertions": []
  }
}
ydahhrk commented 2 years ago

Sorry it's taken so long. Debug commit is at branch issue83.

I need the first logging line that contains the string "VRP Corrupted!":

Jul 21 21:21:10 ERR [V]: After standalone: VRP corrupted!
Jul 21 21:21:10 ERR [V]: After SLURM: VRP corrupted!

It shouldn't crash anymore, but I'm not entirely sure what side effects the bogus VRP might induce.

This is all I have in that file

Ok thank you. Probably not the problem either.

ydahhrk commented 1 year ago

Have you gotten any "VRP corrupted!" messages yet?

Just to clarify: The issue83 branch contains a patch that prevents Fort from crashing, but does not, in fact, fix the bug.

ydahhrk commented 1 year ago

Didn't mean to close this.

Jhoanor commented 1 year ago

With us sometimes it crashes after 1 day, sometimes after more than 6 weeks...

(Cannot implement 1.5.4 though because that would require a RPM package. But if I read correctly I understand #83 is not yet resolved in 1.5.4. anyway)

ydahhrk commented 1 year ago

Ok, I managed to apparently successfully generate the RPMs for 1.5.4, and uploaded them here.

(I say "apparently" because CentOS 8's death forced me to migrate to Rocky Linux 8, and I'm not sure if packages generated there will be compatible with other RHELs. Please feedback.)

In other news, I have so far discovered and fixed at least one undefined behavior during the development of 1.5.5, so the bug might already be fixed in the main branch. For your convenience, I packaged this as rpm-1.5.4.1.tar.gz.

Please install either 1.5.4 or 1.5.4.1, and provide the crashing output once it happens. If it never happens, I would also like to know it.

rfc1036 commented 1 year ago

Do you mind tagging 1.5.4 (and 1.5.4.1?) in the repository? This way I will be able to update the Debian package.

ydahhrk commented 1 year ago

Do you mind tagging 1.5.4

What do you mean? It's been tagged since release.

rfc1036 commented 1 year ago

Nevermind: I tought that you had released a new version with the more recent changes. I will wait for the next one, unless you think that I should package a snapshot right now.

Jhoanor commented 1 year ago

RPM 1.5.4-1 package installs fine on RHEL. Thank you. Now running one day, and still up. I'll let you know over a week if still running (or earlier in case of crash)

Jhoanor commented 11 months ago

Well, it looks like it did the trick. No crashes in more than a month. Chapeau and thanks! :)