NICMx / FORT-validator

RPKI cache validator
MIT License
47 stars 23 forks source link

fort 1.4.2 and 1.2.1 crashes regularly, after 3-5 days uptime. #46

Closed sbr2004 closed 7 months ago

sbr2004 commented 3 years ago

Segmentation Fault. Stack trace: /usr/local/bin/fort(print_stack_trace+0x23) [0x55d196279de3] /usr/local/bin/fort(+0x1ee96) [0x55d196279e96] /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0) [0x7f46201f20e0] /usr/local/bin/fort(+0x34e99) [0x55d19628fe99] /usr/local/bin/fort(rtrhandler_handle_roa_v4+0x52) [0x55d196290fa2] /usr/local/bin/fort(handle_roa_v4+0x32) [0x55d1962936f2] /usr/local/bin/fort(vhandler_handle_roa_v4+0x39) [0x55d19627f2d9] /usr/local/bin/fort(roa_traverse+0x64f) [0x55d1962876ef] /usr/local/bin/fort(rpp_traverse+0x38) [0x55d19627daf8] /usr/local/bin/fort(certificate_traverse+0x9ac) [0x55d196285fec] /usr/local/bin/fort(+0x2d5e3) [0x55d1962885e3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4) [0x7f46201e84a4] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f461ff2ad0f] (Stack size was 13.)

pcarana commented 3 years ago

Hi @sbr2004, thanks for reporting the issue. On which OS is this hapenning? Also, could you please share the arguments that are being set on each execution?

sbr2004 commented 3 years ago

uname -a Linux fort 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux

/usr/local/bin/fort --tal /var/tal --local-repository /var/fort_cache --server.address X.X.X.X --server.port 323 --log.output=syslog --log.level=info --mode=server

pcarana commented 3 years ago

Thanks for the data, we'll be working on this.

This is just by curiosity, but it has come to my attention the difference between both versions, have you tested also the rest of the versions in between 1.2.1 and 1.4.2 (ie. v1.3.0, v1.4.0, and v1.4.1)? This question is mainly to know where we can focus the efforts.difference

sbr2004 commented 3 years ago

No, I did not test other versions. Which of them in between shall I try?

pcarana commented 3 years ago

Thanks for clarify that. Well, none in particular; my question was merely to discard that the issue isn't present at those versions.

pcarana commented 3 years ago

Hi again @sbr2004 , we've reviewing this and "luckily" we could replicate the issue a couple of times.

Currently we have a hypothesis related to the stable version of libcurl and libssl at Debian 9 (the specific dependencies that we recommend to install are libcurl4-openssl-dev and libssl-dev).

FORT validator depends on an OpenSSL version greater than 1.0 (so far this isn't a problem at Debian 9 since it has support for libssl-dev 1.1), and in the case of libcurl an acceptable updated version should be enough. In this particular case libcurl4-openssl-dev depends on libcurl3 which depends on libssl-dev < 1.1.

Probably this could help a bit to see what's described in the previous paragraph Debian - Package libcurl4-openssl-dev

Also, we have this warning at Debian 9 after compiling (make):

/usr/bin/ld: warning: libcrypto.so.1.0.2, needed by /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libcurl.so, may conflict with libcrypto.so.1.1

Now, here comes the part where we're trying to relate all of this. There's a documented behavior related to the use of OpenSSL 1.1 and 1.0 at the same time: https://wiki.debian.org/OpenSSL-1.1

So, we're using libssl-dev 1.1 and a libcurl-dev that's linked to libssl-dev 1.0, we can't discard yet that there's no problem with this. Since yesterday we've running a couple of instances using an updated version of libcurl4-openssl-dev, and I would like to ask for your help to do the same procedure to verify if this solves the issue:

#Add the following line to `/etc/apt/sources.list`:
deb http://deb.debian.org/debian testing main

#Update:
sudo apt-get update

#Install libcurl version from the testing repo:
sudo apt-get -t testing install libcurl4-openssl-dev

#Recompile FORT validator
./configure
make
sudo make install

_NOTE: The warning /usr/bin/ld: warning: libcrypto.so.1.0.2, needed by /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/libcurl.so, may conflict with libcrypto.so.1.1 shouldn't show up again._

Basically we're installing the libcurl4-openssl-dev dependency from the Debian buster repository, which is more udpated that the version at Debian stretch.

So far, our instances at Debian 9 are still alive and we'll leave them running also to verify if this "recipe" solves the problem. Please let me know if you can help us with this.

sbr2004 commented 3 years ago

Hi,

Thank you for answer! My systems are Devuan Ascii (debian w/o systemd). I'll try to figure out how compile as you suggested on Devuan.

sbr2004 commented 3 years ago

Unfortunately this solution did not help. fort 1.4.2, compiled as you advised, crashes from time to time.

ydahhrk commented 3 years ago

Ok. I'm working on this now.

ydahhrk commented 3 years ago

Do you have the 1.2.1 stack trace? Has the stack trace been consistent in 1.4.2?

sbr2004 commented 3 years ago

Hi,

No, I dont have trace for 1.2.1 and cant answer if traces were consistent. Will watch on them now.

ydahhrk commented 3 years ago

Can you run the new commit?

Assuming we get the same stack trace, it should reveal more information. (Some function names are obscured in the original one.)

sbr2004 commented 3 years ago

Complied and launhced on one of servers, will keep an eye on it.

ydahhrk commented 3 years ago

Thank you.

sbr2004 commented 3 years ago

/usr/local/bin/fort(print_stack_trace+0x1f) [0x5564913dabcf] /usr/local/bin/fort(pr_enomem+0x18) [0x5564913dd918] /usr/local/bin/fort(+0x24c89) [0x5564913ddc89] /usr/local/bin/fort(+0x3d596) [0x5564913f6596] /usr/local/bin/fort(rtrhandler_handle_roa_v4+0x3c) [0x5564913f729c] /usr/local/bin/fort(handle_roa_v4+0x32) [0x5564913f98f2] /usr/local/bin/fort(vhandler_handle_roa_v4+0x39) [0x5564913e2779] /usr/local/bin/fort(roa_traverse+0x4c4) [0x5564913ec914] /usr/local/bin/fort(rpp_traverse+0x38) [0x5564913e0e88] /usr/local/bin/fort(certificate_traverse+0xc65) [0x5564913eb375] /usr/local/bin/fort(+0x340fb) [0x5564913ed0fb] /usr/local/bin/fort(+0x34b01) [0x5564913edb01] /usr/local/bin/fort(+0x438c4) [0x5564913fc8c4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7fcc9b487fa3] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fcc9b3b84cf] (Stack size was 15.)

ydahhrk commented 3 years ago

Been reviewing all day, but I'm still empty-handed.

I noticed that this last stack trace doesn't include the Segmentation Fault. Stack trace: header. Did it really crash?

Also, it seems to point to a memory leak this time. How much RAM does this server have?

sbr2004 commented 3 years ago

Yes, it crashed (I've got notification from monitoring). It is a VM with 8G RAM.

Tasks: 114 total, 1 running, 113 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 7978.7 total, 5250.6 free, 225.2 used, 2502.9 buff/cache MiB Swap: 2109.0 total, 2109.0 free, 0.0 used. 7464.5 avail Mem

ydahhrk commented 3 years ago

Some more questions:

sbr2004 commented 3 years ago

Yes, servers are handling RTR requests from routers (total 3 servers). No new stack traces, servers are running fine at the moment.

sbr2004 commented 2 years ago

2 fort servers crashed today within 2 minutes. Restarted them and they crashed again:

server #1: fort 1.5.0

Stack trace: /usr/local/bin/fort(print_stack_trace+0x1f) [0x56476fb0fbcf] /usr/local/bin/fort(pr_crit+0x89) [0x56476fb12d19] /usr/local/bin/fort(+0x1e433) [0x56476fb0c433] /usr/local/bin/fort(deferstack_pop+0x2f) [0x56476fb0c64f] /usr/local/bin/fort(+0x3411a) [0x56476fb2211a] /usr/local/bin/fort(+0x34b01) [0x56476fb22b01] /usr/local/bin/fort(+0x438c4) [0x56476fb318c4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7ff8a7de6fa3] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7ff8a7d174cf] (Stack size was 9.)

server #2: fort 1.2.1

Stack trace: /usr/local/bin/fort(print_stack_trace+0x1a) [0x415c6a] /usr/local/bin/fort(pr_crit+0x7f) [0x417b7f] /usr/local/bin/fort() [0x412c25] /usr/local/bin/fort(deferstack_pop+0x3b) [0x412e0b] /usr/local/bin/fort() [0x423b70] /lib64/libpthread.so.0(+0x7ea5) [0x7f080dca1ea5] /lib64/libc.so.6(clone+0x6d) [0x7f080d9ca8dd] (Stack size was 7.)

ydahhrk commented 2 years ago

It's unrelated. See #58 and #59

sbr2004 commented 2 years ago

Today:

1.5.0

Stack trace: /usr/local/bin/fort(print_stack_trace+0x1f) [0x559142d8ebcf] /usr/local/bin/fort(pr_enomem+0x18) [0x559142d91918] /usr/local/bin/fort(+0x24c89) [0x559142d91c89] /usr/local/bin/fort(+0x3d596) [0x559142daa596] /usr/local/bin/fort(rtrhandler_handle_roa_v4+0x3c) [0x559142dab29c] /usr/local/bin/fort(handle_roa_v4+0x32) [0x559142dad8f2] /usr/local/bin/fort(vhandler_handle_roa_v4+0x39) [0x559142d96779] /usr/local/bin/fort(roa_traverse+0x4c4) [0x559142da0914] /usr/local/bin/fort(rpp_traverse+0x38) [0x559142d94e88] /usr/local/bin/fort(certificate_traverse+0xc65) [0x559142d9f375] /usr/local/bin/fort(+0x340fb) [0x559142da10fb] /usr/local/bin/fort(+0x34b01) [0x559142da1b01] /usr/local/bin/fort(+0x438c4) [0x559142db08c4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f607cd81fa3] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f607ccb24cf] (Stack size was 15.)

ydahhrk commented 2 years ago

It really is just an out of memory error. But whether there's a memory leak...

Ok, before I throw myself to a bunch of lengthy tests, I need to leave this out there:

Can you please upgrade to the latest master? As in, not even to version 1.5.1. Please upgrade to the absolute latest commit.

There have been several critical bugfixes since 1.5.0, to the point I wouldn't even consider it stable anymore.

ydahhrk commented 7 months ago

As stated in the 1.6.0 release notes, I have found and patched several instances of undefined behavior during the reviews. There is no way to prove that these caused this particular crash (particularly considering that I never managed to reproduce it, and the OP already probably left), but the code has changed so much, at this point I expect the bug to manifest in a completely different way, if at all.

If you're still there, please upgrade to the latest version. If it crashes again, please open a new bug.