NICMx / FORT-validator

RPKI cache validator
MIT License
50 stars 24 forks source link

Fort's validation produces no router keys #58

Open moviedo-sc opened 2 years ago

moviedo-sc commented 2 years ago

Hola, repentinamente el proceso de fort falló en todas las instancias que instalamos:

Oct 11 19:04:11 srv-fort fort[2950289]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack Oct 11 19:04:11 srv-fort fort[2950289]: Stack trace: Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(print_stack_trace+0x32) [0x55660962b4f2] Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(pr_crit+0x8f) [0x55660962d8af] Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x1b297) [0x556609628297] Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(deferstack_pop+0x3b) [0x55660962847b] Oct 11 19:04:11 srv-fort fort[2950289]: /usr/bin/fort(+0x2da28) [0x55660963aa28] Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff03fea2609] Oct 11 19:04:11 srv-fort fort[2950289]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff03fdc9293] Oct 11 19:04:11 srv-fort fort[2950289]: (Stack size was 7.) Oct 11 19:04:11 srv-fort fort[2950289]: ERR: rsync://rpki.apnic.net/member_repository/A91DE10F/60EA5C2CB5D311E7B6A2DD5DC4F9AE02/NDyCcTdhxY6CRQ2UqleWffm0bxU.mft: Unknown message digest sha256 Oct 11 19:04:11 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION Oct 11 19:04:12 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

Reiniciamos el servicio y todo funcionó correctamente.

¿Podrían darnos apoyo para identificar el problema que lo causó?

Saludos, Mauricio

juliencorsini commented 2 years ago

+1, getting the same error message since 1 hour ago.

moviedo-sc commented 2 years ago

Como update, el servicio vuelve a caer al poco tiempo de reiniciarlo.

ydahhrk commented 2 years ago

¿Cuál es la versión de Fort?

bilbomx commented 2 years ago

Yo tengo la 1.4.2 y me ocurre lo mismo La instale via rpm en Centos

moviedo-sc commented 2 years ago

Tenemos un mix de versiones. Confirmo que nos pasa con 1.5.1-1 que es el que tenemos más actual, en Ubuntu instalado desde los .deb.

moviedo-sc commented 2 years ago

Aquí el stack trace de la versión 1.5.1-1:

root@srv-fort:/usr/local/bin# systemctl status fort ● fort.service - FORT RPKI validator Loaded: loaded (/lib/systemd/system/fort.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Mon 2021-10-11 20:19:31 UTC; 2min 27s ago Docs: man:fort(8) https://nicmx.github.io/FORT-validator/ Process: 3001527 ExecStart=/usr/bin/fort --configuration-file /etc/fort/config.json (code=exited, status=255/EXCEPTION) Main PID: 3001527 (code=exited, status=255/EXCEPTION)

Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x1e745) [0x55a052dda745] Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(deferstack_pop+0x37) [0x55a052dda9a7] Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x3457a) [0x55a052df057a] Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x34f68) [0x55a052df0f68] Oct 11 20:19:31 srv-fort fort[3001527]: /usr/bin/fort(+0x44ea8) [0x55a052e00ea8] Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f437cab3609] Oct 11 20:19:31 srv-fort fort[3001527]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f437c9da293] Oct 11 20:19:31 srv-fort fort[3001527]: (End of stack trace) Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION Oct 11 20:19:31 srv-fort systemd[1]: fort.service: Failed with result 'exit-code'.

job commented 2 years ago

I observed the following:

pcx$ doas ./fort --tal /etc/rpki/ripe.tal
Oct 11 20:21:48 INF: Disabling validation logging on syslog.
Oct 11 20:21:48 INF: Disabling validation logging on standard streams.
Oct 11 20:21:48 INF: Console log output configured; disabling operation logging on syslog.
Oct 11 20:21:48 INF: (Operation Logs will be sent to the standard streams only.)
Oct 11 20:21:48 WRN: First validation cycle has begun, wait until the next notification to connect your router(s)
Oct 11 20:25:54 ERR: /etc/rpki/ripe.tal: Attempted to pop empty X509 stack
Stack trace:
  0x3c6ea399078 <print_stack_trace+56> at ./fort
  0x3c6ea39a79c <pr_crit+236> at ./fort
  0x3c6ea3948b1 <deferstack_pop+337> at ./fort
  0x3c6ea3947be <deferstack_pop+94> at ./fort
  0x3c6ea3acb81 <perform_standalone_validation+1985> at ./fort
  0x3c6ea3ac7e6 <perform_standalone_validation+1062> at ./fort
  0x3c6ea3bdc5b <thread_pool_wait+619> at ./fort
  0x3c907d07281 <pthread_create+657> at /usr/lib/libpthread.so.26.1
(End of stack trace)
moviedo-sc commented 2 years ago

CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack Sin el archivo TAL de Ripe no se ha caído en 1.5.1-1.

ydahhrk commented 2 years ago

(Español abajo)

This seems to have a simple fix, but it concerns a feature I'm not completely familiar with. The patch I just uploaded seems to yield a successful RIPE traversal, but I'm still investigating whether it's fully correct.

Feedback would be appreciated.


Esto parece tener una corrección muy simple, pero no estoy completamente familiarizado con el código relevante. El parche que acabo de subir parece corregir el problema, pero todavía estoy investigando si es completamente correcto.

¿Podrían bajarlo y ver si se comporta bien?

job commented 2 years ago

@ydahhrk the changeset proposal seems to introduce a memory leak triggering OOM. I'll reach out to you via email to share additional information.

forkwhilefork commented 2 years ago

FWIW I also got this error

Oct 11 14:21:35 core1 fort[7387]: CRT: /etc/fort/tal/ripe.tal: Attempted to pop empty X509 stack Oct 11 14:21:35 core1 fort[7387]: Stack trace: Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(print_stack_trace+0x32) [0x558a924a3e52] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(pr_crit+0x99) [0x558a924a73d9] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x1e563) [0x558a924a0563] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(deferstack_pop+0x2f) [0x558a924a077f] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x34fda) [0x558a924b6fda] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x35a11) [0x558a924b7a11] Oct 11 14:21:35 core1 fort[7387]: /usr/bin/fort(+0x452b4) [0x558a924c72b4] Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f1b0f98cfa3] Oct 11 14:21:35 core1 fort[7387]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f1b0f8bd4cf] Oct 11 14:21:35 core1 fort[7387]: (Stack size was 9.) Oct 11 14:21:35 core1 fort[7387]: Expand failed ! Oct 11 14:21:35 core1 systemd[1]: fort.service: Main process exited, code=exited, status=255/EXCEPTION Oct 11 14:21:35 core1 systemd[1]: fort.service: Failed with result 'exit-code'.

I'd prefer to wait for a stable bugfix release before updating though.

moviedo-sc commented 2 years ago

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files. However, it will be important to determine why it made FORT crash.

ydahhrk commented 2 years ago

The crash is already fixed, but as @job pointed out, the patch introduced a memory leak, and I ran into more problems after that. I'll have a proper refactor tomorrow, hopefully.

Sorry for the inconveniences.

ties commented 2 years ago

Update: the RIPE file with issues was apparently fixed. It works now without modifying any config files. However, it will be important to determine why it made FORT crash.

If there was a ripe file with issues - please list which one. We would be happy to investigate if there really was an unexpected object published in one of the repos we control (and follow up if it was a non-hosted repo).

ydahhrk commented 2 years ago

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

It's hard to tell, because the developer didn't leave relevant unit tests. Which, of course, also needs to be fixed urgently.

I'm hoping to finish patching the code tomorrow.

ties commented 2 years ago

Way I see it, don't worry about it. The "unexpected object" contains a normal feature that was implemented two years ago in Fort, but adoption has apparently been slow (to the point 12 hours ago was seemingly the first time it has appeared in the wild), so the code either broke in the meantime or was never tested with real data.

There were no bgpsec object out in the wild and real test data was/is not available. My understanding is that only Dragon Research Labs rpkid supports creating these objects. I guess this is also a painful reminder that there is an inherent risk to testing new objects in the wild instead under a live tal instead of on a testbed.

job commented 2 years ago

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

ties commented 2 years ago

@job true - it should not crash. RP software crashing in this way because of (valid/non-abusive) inputs is a very scary failure mode and both examples motivate the existence of a diverse ecosystem of RP implementations.

benmaddison commented 2 years ago

@ties the problem is not “new objects”. The problem is that software crashes unexpectedly when it shouldn’t be crashing.

You might also recall a very recent issue with lacking input validation in “old objects”, that affected a lot of validators.

... and that's not to mention the universe of potential garbage that an RP could encounter when processing data retrieved from an untrusted (and therefore potentially hostile) publication point.

Surely it is implicit in the architecture of the RPKI that RPs should not explode if they encounter unknown or invalid objects, let alone objects that conform to the specifications of a standards track RFC?

lukastribus commented 2 years ago

It's a good thing that it does crash; a hung state machine with the RTR sessions up and running would be a lot worse, possibly remaining undetected for a while by a lot of people. The RTR protocol does not have any expire or TTL field for the VRP, routers could make wrong ROV decisions for weeks and months if it wouldn't crash.

Also: a second validator/RTR server with a different software stack can workaround this only if the instance in question crashes. If it just hangs with the RTR sessions up, routers would still use obsolete VRP's until manual intervention.

Manual intervention requires awareness, and this requires active monitoring: https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

I hope we continue to see hard crashes when encountering bugs in validators and RTR servers, as opposed to the more nasty possibilities.

ydahhrk commented 2 years ago

Two new branches:

  1. issue58-simple (matches master)
  2. issue58-proper

Get the first one if you just want to be immune to the bug. It cuts out the whole feature, but then again, adoption is pretty much nonexistent, so you're not missing anything.

The second one is the proper fix, but it still needs more testing.

ydahhrk commented 2 years ago

The issue58-simple version of the patch has been released as part of 1.5.2.

issue58-proper is still in the testing phase.