aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

Closed tnsardesai closed 7 months ago

tnsardesai commented 8 months ago

Summary What are you observing that doesn't seem right? Seeing a panic

[2024-01-03 23:46:27.247][92][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.ingress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:
called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }
[2024-01-03 23:46:27.247][92][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_request_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x385ad - rust_panic
1: 0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2: 0x38a56 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3: 0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4: 0x24033 - rust_begin_unwind
5: 0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6: 0x425f - core::result::unwrap_failed::h8b3db0f11171b57b
7: 0x103f3 - proxy_wasm::hostcalls::get_map::hfc84c18de12c4fd4
8: 0xff0e - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_request_headers::h60ed0437583308a0
9: 0xd219 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_request_headers::h706d8f144b9ef225

I also see a similar log for [93][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:

Steps to Reproduce What are the steps you can take to reproduce this issue? No idea. But pretty sure this happens only under high load. Looking at our logs I am seeing this error only on services which receive a high amount of load.

Are you currently working around this issue? How are you currently solving this problem? Even after seeing this log the container keeps functioning and responding to requests. I am not sure what the impact on this panic is. We are also seeing envoy metrics are not getting reported so that might be the impact.

Additional context Anything else we should know? This started after we updated our sidecar to 1.27.2.0.

tnsardesai commented 8 months ago

Saw a slightly different error today

[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0:  0x385ad - rust_panic
1:  0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2:  0x38a24 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3:  0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4:  0x24033 - rust_begin_unwind
5:  0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6:  0xced1 - core::panicking::panic::h0f0c05b20da93dd7
7:  0x1179d - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h8b7afc1a5cddb214
8:  0xcfa5 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h61f448a25b9c1222
9:  0x2be71 - proxy_on_done

I rolled back a service and gateway to v1.25.1.0. This is not happening in that version of envoy

ytsssun commented 8 months ago

Hi @tnsardesai ,

Thanks for the bug report. We are looking into it. To help us better understand the issue, could you share a bit more information about your mesh setup?

You mentioned this is only observed when in high load, do you by any chance have the snapshot of the envoy stats (the Envoy crashed so you might not be able to retrieve it)?

Have you tried any other version of Envoy images? For example v1.25.3.0 and onward.

Lastly, if you have Premium Support, feel free to engage them to cut us a ticket.

ytsssun commented 8 months ago

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes.

bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49]

This easily reproducible if I ran following code, see the playground:

#![allow(unused)]
fn main() {
  use std::str;

  // some bytes, in a vector
  let problematic_bytes = vec! [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49];

  let result = str::from_utf8(&problematic_bytes).unwrap();
}

I submitted a bug report to proxy wasm sdk - https://github.com/proxy-wasm/proxy-wasm-rust-sdk/issues/217.

Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

tnsardesai commented 8 months ago

Thanks for your response. I previously thought it was due to high load but looking at the data over 7 days it is happening at all times of the day. I believe this is happening on a specific bot or attacker request with bad headers. All of these errors are on our externally exposed virtual gateways and applications. We are not seeing this in any internal applications.

I have rolled back all our applications and gateways which see error to v.1.25.1.0. I could try upgrading some of our applications next week to help pin point what version broke this if that will aid your investigation.

I have already engaged aws support, case id is 170303481101460 for your reference

Can you also take a look at the error in https://github.com/aws/aws-app-mesh-roadmap/issues/484#issuecomment-1883590049. It has a different stack trace.

I also saw this other log today which is different than the other two.

[2024-01-11 22:24:28.980][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at 'called `Option::unwrap()` on a `None` value', amzn-appmesh-aggregate-stats/src/http/egress.rs:172:9
[2024-01-11 22:24:28.980][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x2db69 - rust_panic
1: 0x2c9a3 - std::panicking::rust_panic_with_hook::hbafe3e603d331223
2: 0x2e0f1 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h8ab6ee68d5b4c391
3: 0x2e059 - std::sys_common::backtrace::__rust_end_short_backtrace::h008f69666d134159
4: 0x1f5cc - rust_begin_unwind
5: 0x40ed - core::panicking::panic_fmt::h1d17fc068f528130
6: 0xc34f - core::panicking::panic::h27b5eefa3e4ff738
7: 0x1802e - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h90cbcd72c8e2cb58
8: 0xc3f3 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h30819f4e590de618
9: 0x26138 - proxy_on_done

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes. Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

We have noticed an increase in non-UTF-8 bytes in our access logs. For example right around the time of this stack trace I see this access log. [2024-01-11T22:24:28.968Z] "GET /oauth/lookup-ip HTTP/1.1" 400 - 0 0 17 17 "����%2527%2522\'\", 45.115.27.18, 70.132.38.113,127.0.0.1" "Mozilla/5.0 (Linux; U; Android 4.0.3; en-us; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" "09c51df3-6b4d-9e15-8bf6-472c96e8b355" "production--oauth--833ed5a3.prod.mesh" "127.0.0.1:80"

PiotrSikora commented 8 months ago
[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
[...]

@tnsardesai this is different than the UTF-8 bug from the initial comment, and it looks like a bug in amzn-appmesh-aggregate-stats, so it probably makes sense to open a separate issue for that.

tnsardesai commented 8 months ago

I created a #485 for the other bug! Over the weekend I saw this bug even on v1.25.1.0 so definitely different since I haven't seen the bug in this issue over the weekend on v1.25.1.0.

This week I plan on upgrading envoy versions to help pin point the exact version of envoy where things broke!

tnsardesai commented 8 months ago

Hey I upgraded a service from public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.1.0-prod to public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.3.0-prod and now got this error in that service today. Other services on v1.25.1.0 are not seeing this error so v1.25.3.0 must be the first version with this error

[2024-01-17 01:35:05.995][77][critical][wasm] [source/extensions/common/wasm/[context.cc:1157](http://context.cc:1157/)] wasm log aws.appmesh.ingress_http_stats: panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [97, 116, 116, 97, 99, 104, 109, 101, 110, 116, 59, 32, 102, 105, 108, 101, 110, 97, 109, 101, 61, 34, 83, 237, 108, 97, 98, 97, 115, 32, 84, 243, 110, 105, 99, 97, 115, 32, 121, 32, 97, 116, 243, 110, 97, 115, 32, 49, 54, 32, 101, 110, 101, 114, 111, 32, 50, 48, 50, 52, 46, 112, 110, 103, 34], error: Utf8Error { valid_up_to: 23, error_len: Some(1) } }', /codebuild/output/src573/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/[hostcalls.rs:1192](http://hostcalls.rs:1192/):42
[2024-01-17 01:35:05.995][77][error][wasm] [source/extensions/common/wasm/[wasm_vm.cc:38](http://wasm_vm.cc:38/)] Function: proxy_on_response_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x37468 - rust_panic
1: 0x3652d - std::panicking::rust_panic_with_hook::ha23c8ba67d558e34
2: 0x3794b - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h17240ff11865f8cf
3: 0x37881 - std::sys_common::backtrace::__rust_end_short_backtrace::h4c1464e549336429
4: 0x22abb - rust_begin_unwind
5: 0x24b3 - core::panicking::panic_fmt::h573ffd15dee74109
6: 0x419f - core::result::unwrap_failed::h26e3bef5a756b770
7: 0x11077 - proxy_wasm::hostcalls::get_map::h042a2451be115839
8: 0x11111 - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_response_headers::h7c3e1959a20473cb
9: 0xcb7f - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_response_headers::hb63fc0cb769db28f
ytsssun commented 7 months ago

Hi @tnsardesai , just so you know that we are working on a Envoy patch release. It will contain the fix for this issue and https://github.com/aws/aws-app-mesh-roadmap/issues/485. I cannot provide a specific date, but it will be soon.

liubnu commented 7 months ago

This issue is fixed by Envoy release v1.27.3.0 #486