aws / amazon-ecs-service-connect-agent

Amazon ECS Service Connect Agent
Apache License 2.0
27 stars 10 forks source link

setup process open files limit #73

Closed axot closed 5 months ago

axot commented 5 months ago

Summary

This PR fixes #71, https://github.com/aws/aws-app-mesh-roadmap/issues/489

Implementation details

As mentioned in the discussion on the issue https://github.com/aws/aws-app-mesh-roadmap/issues/489#issuecomment-2014345823, there has been a change in how Go 1.21 handles the restoration of the NOFILE resource rlimit in child processes.

To address this change and ensure that the NOFILE rlimit is properly set in the Envoy process, before forking the Envoy process, we verify and configure the NOFILE rlimit correctly in the parent process. This ensure that the child process inherits the same resource limits.

Testing

I've improved TestStartCommand to check the NOFILE limits of the forked process. It ensures that the soft and hard limits are equal and greater than 65535, providing better test coverage and confirming the proper inheritance of resource limits from the parent process.

New tests cover the changes: yes

Description for the changelog

Fix NOFILE limit in forked child process.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

karanvasnani commented 5 months ago

I have built a custom image with this change for testing: 354290006986.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-raise-nofile-soft-limit, let me know if you're able to test using this, I'll do some testing on my end as well.

axot commented 5 months ago

Thanks for building the image, i've test on my environment and confirmed the issue was solved, details are below.

original image: 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-prod

[2024-03-28 01:19:11.251][73][critical][assert] [source/common/network/socket_interface_impl.cc:72] assert failure: SOCKET_VALID(result.return_value_). Details: socket(2) failed, got error: Too many open files
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x53900000028
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: ea4876d073af5a66f4ec971a64a154a8bf79ad1c/1.27.3-appmesh.0/Modified/RELEASE/BoringSSL
[symbolize_elf.inc : 1010] RAW: /proc/self/task/40/maps: errno=24
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #0: [0x7f1e637e48e0]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x56185abadc62]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #2: [0x56185aa122ab]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x56185aa0f06b]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #4: [0x56185aa0786f]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #5: [0x56185aa00670]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #6: [0x56185a6fa679]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #7: [0x56185a6fa115]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #8: [0x56185a6fa90d]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #9: [0x56185a6de69d]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #10: [0x56185a6de537]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #11: [0x56185a6df396]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #12: [0x56185a6df6d7]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #13: [0x56185a6ef615]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #14: [0x56185a6f1cc9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #15: [0x56185a6e3cfe]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #16: [0x56185a8bbea9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #17: [0x56185a8d3849]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #18: [0x56185a8c2331]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #19: [0x56185a92b873]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #20: [0x56185a83bd44]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #21: [0x56185a862cc4]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #22: [0x56185a8600ed]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #23: [0x56185a85fca9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #24: [0x56185aebc139]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #25: [0x56185a85dafc]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #26: [0x56185a85d4a8]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #27: [0x56185a862aa4]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #28: [0x56185a836a31]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #29: [0x56185aa4efc6]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #30: [0x56185aa0dcff]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #31: [0x56185aa0bdfc]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #32: [0x56185aa02685]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #33: [0x56185aa03676]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #34: [0x56185aec587b]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #35: [0x56185aec45c0]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #36: [0x56185a427d4a]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #37: [0x56185aecb9e9]
[2024-03-28 01:19:11.251][73][critical][backtrace] [./source/server/backtrace.h:98] #38: [0x7f1e637da44b]
ActiveStream 0x167e7d28c700, stream_id_: 11189730723575927948&filter_manager_:
  FilterManager 0x167e7d28c7a8, state_.has_1xx_headers_: 0
  filter_manager_callbacks_.requestHeaders():
    ':authority', 'a88c9a4bcbb8a4eda8e22bfc38dcc718-755162509.ap-northeast-1.elb.amazonaws.com'
    ':path', '/'
    ':method', 'GET'
    ':scheme', 'http'
    'x-forwarded-proto', 'http'
    'x-request-id', 'e65f3505-6285-924d-81a6-38ec11c23588'
    'x-envoy-expected-rq-timeout-ms', '15000'
    'x-amzn-trace-id', 'Root=1-6604c58f-7392989366a9460da03a84d2;Parent=f1d3f3c0aea707c4;Sampled=0'
  filter_manager_callbacks_.requestTrailers():   null
  filter_manager_callbacks_.responseHeaders():   null
  filter_manager_callbacks_.responseTrailers():   null
  &streamInfo():
    StreamInfoImpl 0x167e7d28c8d8, protocol_: 1, response_code_: null, response_code_details_: null, attempt_count_: 1, health_check_request_: 0, route_name_:     upstream_info_:
      UpstreamInfoImpl 0x167e7d28a160, upstream_connection_id_: null
    OverridableRemoteConnectionInfoSetterStreamInfo 0x167e7d28c8d8, remoteAddress(): 10.0.10.144:50521, directRemoteAddress(): 10.0.10.144:50521, localAddress(): 10.0.10.231:80
Http1::ConnectionImpl 0x167e7d28c008, dispatching_: 1, dispatching_slice_already_drained_: 0, reset_stream_called_: 0, handling_upgrade_: 0, deferred_end_stream_headers_: 1, processing_trailers_: 0, buffered_body_.length(): 0, header_parsing_state_: Done, current_header_field_: , current_header_value_:
active_request_:
, request_url_: null, response_encoder_.local_end_stream_: 0
absl::get<RequestHeaderMapPtr>(headers_or_trailers_): null
current_dispatching_buffer_ front_slice length: 101 contents: "GET [2024-03-28 01:19:11.251][102][critical][assert] [source/common/network/socket_interface_impl.cc:72] assert failure: SOCKET_VALID(result.return_value_). Details: socket(2) failed, got error: Too many open files
/ HTTP/1.1\r\nHost: a88c9a4bcbb8a4eda8e22bfc38dcc718-755162509.ap-northeast-1.elb.amazonaws.com\r\n\r\n"
ConnectionImpl 0x167e7d6ad900, connecting_: 0, bind_error_: 0, state(): Open, read_buffer_limit_: 1048576
socket_:
  ListenSocketImpl 0x167e7daad400, transport_protocol_: raw_buffer
  connection_info_provider_:
    ConnectionInfoSetterImpl 0x167e7e8a6c40[2024-03-28 01:19:11.336][1][warning] [AppNet Agent] [Envoy process 40] Exited with code [-1]
[2024-03-28 01:19:11.336][1][warning] [AppNet Agent] [Envoy process 40] Additional Exit data: [Core Dump: false][Normal Exit: false][Process Signalled: true]

with this image 354290006986.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.27.3.0-raise-nofile-soft-limit

# no error log

# pid 27 is envoy
$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- cat /proc/27/cmdline
/usr/bin/envoy-c/tmp/envoy-config-990344385.yaml-linfo--drain-time-s20--disable-hot-restart

$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'grep "Max open files" /proc/*/limits'
/proc/1/limits:Max open files            65535                65535                files
/proc/27/limits:Max open files            65535                65535                files
/proc/355/limits:Max open files            1024                 65535                files
/proc/self/limits:Max open files            1024                 65535                files
/proc/thread-self/limits:Max open files            1024                 65535                files

$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'echo /proc/27/fd/*' | sed 's/ /\n/g' | sort -t/ -k 5 -rn | wc -l
2124

$ keti yelb-ui-766d495c5c-fhkpg -c envoy -- sh -c 'echo /proc/27/fd/*' | sed 's/ /\n/g' | sort -t/ -k 5 -rn | head
/proc/27/fd/2137
/proc/27/fd/2136
/proc/27/fd/2135
/proc/27/fd/2134
/proc/27/fd/2133
/proc/27/fd/2132
/proc/27/fd/2131
/proc/27/fd/2130
/proc/27/fd/2129
/proc/27/fd/2128

test command
$ wrk -c 1000 -t 100 -d 12h http://xxx.ap-northeast-1.elb.amazonaws.com/