NICMx / FORT-validator

RPKI cache validator
MIT License
50 stars 24 forks source link

Fort crashed 1.5.4 (master branch) #89

Closed avermeer-tc closed 10 months ago

avermeer-tc commented 1 year ago

We are running the latest fort from the master branch. Today fort crashed with the following stack trace:

Jan 26 04:49:13 rpki01 fort[898617]: ERR: Unknown protocol: 114
Jan 26 04:49:13 rpki01 fort[898617]: Stack trace:
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(print_stack_trace+0x22) [0x55751aaaa0d2]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(pr_crit+0x82) [0x55751aaab9d2]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(deltas_add_roa+0x1f3) [0x55751aac6db3]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(+0x3be9a) [0x55751aac4e9a]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(compute_deltas+0x46) [0x55751aac66e6]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(+0x3e44f) [0x55751aac744f]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(validation_run_cycle+0x2f) [0x55751aaafbaf]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(main+0x18f) [0x55751aaa642f]
Jan 26 04:49:13 rpki01 fort[898617]:   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f60a6240d0a]
Jan 26 04:49:13 rpki01 fort[898617]:   /usr/bin/fort(_start+0x2a) [0x55751aaa646a]
Jan 26 04:49:13 rpki01 fort[898617]: (End of stack trace)
Jan 26 04:49:13 rpki01 systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Jan 26 04:49:13 rpki01 systemd[1]: fort.service: Failed with result 'exit-code'.
Jan 26 04:49:13 rpki01 systemd[1]: fort.service: Consumed 1d 14h 14min 24.014s CPU time.

A couple hours later on our different node:

Jan 26 10:26:08 rpki01-la fort[910328]: WRN: Location ‘/etc/fort/slurm/’ doesn’t have files with extension ‘.slurm’
Jan 26 10:26:08 rpki01-la fort[910328]: ERR: Unknown protocol: 0
Jan 26 10:26:08 rpki01-la fort[910328]: Stack trace:
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(print_stack_trace+0x22) [0x55a40fc200d2]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(pr_crit+0x82) [0x55a40fc219d2]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(deltas_add_roa+0x1f3) [0x55a40fc3cdb3]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(+0x3be9a) [0x55a40fc3ae9a]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(compute_deltas+0x46) [0x55a40fc3c6e6]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(+0x3e44f) [0x55a40fc3d44f]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(validation_run_cycle+0x2f) [0x55a40fc25baf]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(main+0x18f) [0x55a40fc1c42f]
Jan 26 10:26:08 rpki01-la fort[910328]:   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7fef1e207d0a]
Jan 26 10:26:08 rpki01-la fort[910328]:   /usr/bin/fort(_start+0x2a) [0x55a40fc1c46a]
Jan 26 10:26:08 rpki01-la fort[910328]: (End of stack trace)
Jan 26 10:26:08 rpki01-la systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION
Jan 26 10:26:08 rpki01-la systemd[1]: fort.service: Failed with result ‘exit-code’.
Jan 26 10:26:08 rpki01-la systemd[1]: fort.service: Consumed 1d 15h 1min 47.393s CPU time.
ydahhrk commented 1 year ago

Are you, by any chance, using the argument --output.roa?

avermeer-tc commented 1 year ago

I don't believe so:

fort     1027306 11.0  8.5 2253252 343856 ?      Ssl  Jan26 623:54 /usr/bin/fort --configuration-file /etc/fort/config.json
{
    "tal": "/etc/fort/tal",
    "local-repository": "/var/lib/fort",
    "slurm": "/etc/fort/slurm/",
    "server": {
        "port": "323"
    },
    "log": {
        "output": "syslog"
    },
    "thread-pool": {
        "server": {
            "max": 45
        },
        "validation": {
            "max": 5
        }
    }
}
ydahhrk commented 1 year ago

Does this happen reliably?

And if so, may I have access to an environment where it happens? (I know this is a stretch)

avermeer-tc commented 1 year ago

Hi @ydahhrk. It happens sporadically. Do you want me to compile a version and run it from the debug branch issue83? I can't give you access to our production machines, but I am not opposed to running with the extra debug patches.

ydahhrk commented 1 year ago

Do you want me to compile a version and run it from the debug branch issue83?

Sure, it would help. The only reason why I haven't asked is because I'm still trying to come up with ways to improve the output. But even as it currently is, there's a chance it'll improve the situation.

Though at this point I'm desperate enough to merge the debugging messages into the main branch. Put them on a release, even. Considering this, my failed attempt to consummate version 1.5.4 back in December might have turned out to be a blessing in disguise.

Another thing: Is your OS generating core dumps? If you have one from a crash, may I have it?

avermeer-tc commented 1 year ago

I didn't have core dumps enabled at the time of the crashes. I have enabled them now.

ydahhrk commented 1 year ago

Thank you.

FYI, I finally managed to consummate version 1.5.4, 12 hours ago. In case you want to update.

(It contains the debugging messages.)

ydahhrk commented 10 months ago

As stated in the 1.6.0 release notes, I have found and patched several instances of undefined behavior during the reviews. There is no way to prove that these caused this particular crash (particularly considering that I never managed to reproduce it, and the OP already probably left), but the code has changed so much, at this point I expect the bug to manifest in a completely different way, if at all.

Please upgrade to the latest version. If it crashes again, please open a new bug.

avermeer-tc commented 10 months ago

I've been running the patched version since Feb. I have not encountered the issue since then.