Closed axot closed 7 months ago
I have built a custom image with this change for testing: 354290006986.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-raise-nofile-soft-limit
, let me know if you're able to test using this, I'll do some testing on my end as well.
Thanks for building the image, i've test on my environment and confirmed the issue was solved, details are below.
original image: 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-prod
[2024-03-28 01:19:11.251][73][critical][assert] [source/common/network/socket_interface_impl.cc:72] assert failure: SOCKET_VALID(result.return_value_). Details: socket(2) failed, got error: Too many open files
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x53900000028
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: ea4876d073af5a66f4ec971a64a154a8bf79ad1c/1.27.3-appmesh.0/Modified/RELEASE/BoringSSL
[symbolize_elf.inc : 1010] RAW: /proc/self/task/40/maps: errno=24
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #0: [0x7f1e637e48e0]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x56185abadc62]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #2: [0x56185aa122ab]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x56185aa0f06b]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #4: [0x56185aa0786f]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #5: [0x56185aa00670]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #6: [0x56185a6fa679]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #7: [0x56185a6fa115]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #8: [0x56185a6fa90d]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #9: [0x56185a6de69d]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #10: [0x56185a6de537]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #11: [0x56185a6df396]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #12: [0x56185a6df6d7]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #13: [0x56185a6ef615]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #14: [0x56185a6f1cc9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #15: [0x56185a6e3cfe]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #16: [0x56185a8bbea9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #17: [0x56185a8d3849]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #18: [0x56185a8c2331]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #19: [0x56185a92b873]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #20: [0x56185a83bd44]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #21: [0x56185a862cc4]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #22: [0x56185a8600ed]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #23: [0x56185a85fca9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #24: [0x56185aebc139]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #25: [0x56185a85dafc]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #26: [0x56185a85d4a8]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #27: [0x56185a862aa4]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #28: [0x56185a836a31]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #29: [0x56185aa4efc6]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #30: [0x56185aa0dcff]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #31: [0x56185aa0bdfc]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #32: [0x56185aa02685]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #33: [0x56185aa03676]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #34: [0x56185aec587b]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #35: [0x56185aec45c0]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #36: [0x56185a427d4a]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #37: [0x56185aecb9e9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #38: [0x7f1e637da44b]
ActiveStream 0x167e7d28c700, stream_id_: 11189730723575927948&filter_manager_:
FilterManager 0x167e7d28c7a8, state_.has_1xx_headers_: 0
filter_manager_callbacks_.requestHeaders():
':authority', 'a88c9a4bcbb8a4eda8e22bfc38dcc718-755162509.ap-northeast-1.elb.amazonaws.com'
':path', '/'
':method', 'GET'
':scheme', 'http'
'x-forwarded-proto', 'http'
'x-request-id', 'e65f3505-6285-924d-81a6-38ec11c23588'
'x-envoy-expected-rq-timeout-ms', '15000'
'x-amzn-trace-id', 'Root=1-6604c58f-7392989366a9460da03a84d2;Parent=f1d3f3c0aea707c4;Sampled=0'
filter_manager_callbacks_.requestTrailers(): null
filter_manager_callbacks_.responseHeaders(): null
filter_manager_callbacks_.responseTrailers(): null
&streamInfo():
StreamInfoImpl 0x167e7d28c8d8, protocol_: 1, response_code_: null, response_code_details_: null, attempt_count_: 1, health_check_request_: 0, route_name_: upstream_info_:
UpstreamInfoImpl 0x167e7d28a160, upstream_connection_id_: null
OverridableRemoteConnectionInfoSetterStreamInfo 0x167e7d28c8d8, remoteAddress(): 10.0.10.144:50521, directRemoteAddress(): 10.0.10.144:50521, localAddress(): 10.0.10.231:80
Http1::ConnectionImpl 0x167e7d28c008, dispatching_: 1, dispatching_slice_already_drained_: 0, reset_stream_called_: 0, handling_upgrade_: 0, deferred_end_stream_headers_: 1, processing_trailers_: 0, buffered_body_.length(): 0, header_parsing_state_: Done, current_header_field_: , current_header_value_:
active_request_:
, request_url_: null, response_encoder_.local_end_stream_: 0
absl::get<RequestHeaderMapPtr>(headers_or_trailers_): null
current_dispatching_buffer_ front_slice length: 101 contents: "GET [2024-03-28 01:19:11.251][102][critical][assert] [source/common/network/socket_interface_impl.cc:72] assert failure: SOCKET_VALID(result.return_value_). Details: socket(2) failed, got error: Too many open files
/ HTTP/1.1\r\nHost: a88c9a4bcbb8a4eda8e22bfc38dcc718-755162509.ap-northeast-1.elb.amazonaws.com\r\n\r\n"
ConnectionImpl 0x167e7d6ad900, connecting_: 0, bind_error_: 0, state(): Open, read_buffer_limit_: 1048576
socket_:
ListenSocketImpl 0x167e7daad400, transport_protocol_: raw_buffer
connection_info_provider_:
ConnectionInfoSetterImpl 0x167e7e8a6c40[2024-03-28 01:19:11.336][1][warning] [AppNet Agent] [Envoy process 40] Exited with code [-1]
[2024-03-28 01:19:11.336][1][warning] [AppNet Agent] [Envoy process 40] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]
with this image 354290006986.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-raise-nofile-soft-limit
# no error log
# pid 27 is envoy
$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- cat /proc/27/cmdline
/usr/bin/envoy-c/tmp/envoy-config-990344385.yaml-linfo--drain-time-s20--disable-hot-restart
$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'grep "Max open files" /proc/*/limits'
/proc/1/limits:Max open files 65535 65535 files
/proc/27/limits:Max open files 65535 65535 files
/proc/355/limits:Max open files 1024 65535 files
/proc/self/limits:Max open files 1024 65535 files
/proc/thread-self/limits:Max open files 1024 65535 files
$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'echo /proc/27/fd/*' | sed 's/ /\n/g' | sort -t/ -k 5 -rn | wc -l
2124
$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'echo /proc/27/fd/*' | sed 's/ /\n/g' | sort -t/ -k 5 -rn | head
/proc/27/fd/2137
/proc/27/fd/2136
/proc/27/fd/2135
/proc/27/fd/2134
/proc/27/fd/2133
/proc/27/fd/2132
/proc/27/fd/2131
/proc/27/fd/2130
/proc/27/fd/2129
/proc/27/fd/2128
test command
$ wrk -c 1000 -t 100 -d 12h http://xxx.ap-northeast-1.elb.amazonaws.com/
Summary
This PR fixes #71, https://github.com/aws/aws-app-mesh-roadmap/issues/489
Implementation details
As mentioned in the discussion on the issue https://github.com/aws/aws-app-mesh-roadmap/issues/489#issuecomment-2014345823, there has been a change in how Go 1.21 handles the restoration of the NOFILE resource rlimit in child processes.
To address this change and ensure that the NOFILE rlimit is properly set in the Envoy process, before forking the Envoy process, we verify and configure the NOFILE rlimit correctly in the parent process. This ensure that the child process inherits the same resource limits.
Testing
I've improved TestStartCommand to check the NOFILE limits of the forked process. It ensures that the soft and hard limits are equal and greater than 65535, providing better test coverage and confirming the proper inheritance of resource limits from the parent process.
New tests cover the changes: yes
Description for the changelog
Fix NOFILE limit in forked child process.
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.