florincoras / envoy-vpp

Apache License 2.0
16 stars 4 forks source link

Envoy crash #5

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi I've got a problem with latest version of envoy-vpp. Instance of an envoy get crashed after few requests. I tried to build everything in debug mode to check whats going on. After all I got an error in vpp library after first request from client. /home/vagrant/envoy-vpp/vpp/src/vcl/vppcom.c:2046 (vppcom_session_free_segments) assertion 's->rx_bytes_pending < n_bytes' fails Here I share full logs from an envoy instance: envoy.log

florincoras commented 3 years ago

I suspect this might be related to the fact that sessions are duplicated. Are you running with port reuse enabled?

On a separate note, I recently found out that the listener code in upstream envoy has been considerably refactored and the new approach is not compatible with envoy-vpp. I'm working on a fix.

ghost commented 3 years ago

Do you mean reuse_port in envoy listener settings? As I checked, this option supposed to be true on default. Unfortunately, setting it manually doesn't change anything.

florincoras commented 3 years ago

Yes, that's the config I was thinking about. I believe it's default enabled now on main but might not have been some months ago.

Either way, I've rebased the code onto latest envoy and have pulled in a more recent version of vpp. Could you try the latest patch and see if you're still hitting the issue? Everything seems to be working fine in my simple "proxy to backend" test (HttpConnectionManager). But that's not to say the zero-copy rx code has been tested / is compatible with all filters.

ghost commented 3 years ago

Yeah, I've tried only using simple http server. Everything seems to be working fine now if I compile it in release(-c opt) version. In debug build, I still get the same error in assert. envoy.log

florincoras commented 3 years ago

Thanks for checking! For whatever reason, I'm not able to reproduce the error. Maybe it's because I'm using wrk as a client but I did try both release and debug imaged for vcl. So we can do two things:

  1. To fix your problem, can you try the latest patch? It makes rx zero copy configurable and defaults to it off.
  2. If you have the time, run envoy from gdb and check what s->rx_bytes_pending and n_bytes are when vcl asserts. I'd like to default to rx zero copy on because it does seem to yield some benefits (maybe around 4%), but these type of issues should be ironed out first.
ghost commented 3 years ago

You are right. I think that I didn't clean up my environment. I've rebuilt completely VPP and Envoy, everything seems to be working fine now. Thx for your help. I have one more question, do you know where I could find good knowledge resources about using VCL interface?

florincoras commented 3 years ago

Did you pull in the latest patch? If not, then I can switch rx zero copy back on :-)

As for VCL docs, the best resource at this point is this. If you hit any issues send an email to vpp-dev@fd.io (or to me).

ghost commented 3 years ago

I've pulled last patch and built after that. I could check with clean build what happen if I set VCL_RX_ZC to 1.