Closed armenr closed 1 year ago
As an example --> I just killed the exact kubernetes pod that had the issue -- a new one came up in its place, and it just magically worked. No rhyme or reason to why...nothing changed in the configs or on the system side.
a/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37118
qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37124
qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37126
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.890+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.901+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.916+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.922+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.929+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.935+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.941+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.947+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.953+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.959+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.966+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.975+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.981+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.042+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.049+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.063+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.070+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.076+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.082+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.088+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.094+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.100+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.108+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.115+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.121+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.128+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.134+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.187+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.194+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.201+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.208+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.214+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.221+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.228+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.235+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.242+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.249+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.256+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.263+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.272+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.279+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.286+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.292+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger
qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1:
qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1:37124
qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1:37126
qa/qa-monometa-59f764cd5b-wh775[app]: [RUN 03] Indexing users
What exact CPU are you running this on?
On Wed, 17 May 2023, 17:50 Armen Rostamian, @.***> wrote:
As an example --> I just killed the exact kubernetes pod that had the issue -- a new one came up in its place, and it just magically worked. No rhyme or reason to why...nothing changed in the configs or on the system side.
a/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37118 qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37124 qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:05 +00:00: Client connected: 127.0.0.1:37126 qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.890+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.901+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.916+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.922+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.929+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.935+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.941+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.947+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.953+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.959+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.966+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.975+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:05.981+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.042+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.049+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.063+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.070+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.076+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.082+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.088+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.094+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.100+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.108+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.115+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.121+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.128+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.134+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.187+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.194+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.201+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.208+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.214+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.221+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.228+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.235+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.242+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.249+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.256+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.263+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.272+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.279+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.286+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: [2023-05-17T15:46:06.292+0000][DEBUG][InstallEvents#471] Install event logic for trigger: SomeRedactedTrigger qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1: qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1:37124 qa/qa-monometa-59f764cd5b-wh775[app]: 2023-05-17 15:46:06 +00:00: Client disconnected: 127.0.0.1:37126 qa/qa-monometa-59f764cd5b-wh775[app]: [RUN 03] Indexing users
— Reply to this email directly, view it on GitHub https://github.com/cloudamqp/amqproxy/issues/117#issuecomment-1551645696, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABL6TXYWTT4OKETBYRKWP3XGTXTXANCNFSM6AAAAAAYFIX2H4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@carlhoerberg - This is running inside of Kubernetes, on x86_64 architecture.
We build/compile the binary inside a Docker container.
The builder instance where we compile the code runs in 64-bit Docker on this AWS instance type:
c6i.4xlarge
vCPUs | 16
-- | --
Memory (GiB) | 32.0
Memory per vCPU (GiB) | 2.0
Physical Processor | Intel Xeon 8375C (Ice Lake)
Clock Speed (GHz) | 3.5
CPU Architecture | x86_64
As for the exact CPU where amqproxy
runs...That's tough to explain. We autoscale on AWS, and we use a scaler which will look for the cheapest SPOT instances within a range of instance types.
Those are typically t3, t3a, r5, r6, m5, m5a, m6, m6a, c5, c6i, c6, c6a...etc
So, a mix of Intel + AMD X86_64 hosts. I am not sure which instance type these particular pods were running on when we hit these errors. Testing will be time-consuming, but I can try...
I do have one failed process/pod right now (where we hit the error condition and core dump) -- it's on a c6i.8xlarge
--> Intel Xeon 8375C (Ice Lake)
@carlhoerberg - I thought this might also be worth mentioning -->
We're not using the Dockerfile or the Docker build you guys ship inside of Alpine. We're compiling and using the amqproxy binary in Debian bullseye.
With that said, I'm still seeing the same error/issue when I isolate the amqproxy into its own set of pods/services, and I run your Alpine image as a solo container.
#syntax=docker/dockerfile:1.4
ARG BULLSEYE_VERSION="bullseye-20230109-slim"
################################
# Build AMQProxy
################################
FROM debian:$BULLSEYE_VERSION as amqproxy-builder
ENV DEBIAN_FRONTEND=noninteractive
ENV DEBCONF_FRONTEND=noninteractive
ARG AMQPROXY_VERSION="v0.8.8"
# Install deps
RUN <<EOF
set -eux
apt-get update
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
curl \
git \
libssl-dev \
pkg-config \
wget
EOF
# Setup and build amqproxy
RUN <<EOF
set -eux
curl -fsSL https://crystal-lang.org/install.sh | bash
git clone https://github.com/cloudamqp/amqproxy.git
git checkout $AMQPROXY_VERSION
cd amqproxy
shards build --release --production
cp bin/amqproxy /usr/bin
EOF
I'm just gonna keep posting findings as they come -- I figure oversharing info is better than not having enough.
I've now ripped out the manual Docker builds and the official Alpine-based Docker build, and I'm installing AMQProxy through the deb packages you guys provide in PackageCloud.
I'll report back with findings/results. So far, no core dumps.
Uhh...this is new. I wonder if this helps to debug/indicate the problem?
2023-05-19 18:27:26 +00:00: Error reading from upstream: End of file reached (IO::EOFError)
from /usr/share/crystal/src/io.cr:523:27 in 'read_fully'
from /tmp/amqproxy/lib/amq-protocol/src/amq/protocol/frames.cr:26:9 in 'read_loop'
from /tmp/amqproxy/src/amqproxy/upstream.cr:34:7 in '->'
from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
from /usr/share/crystal/src/fiber.cr:98:34 in '->'
from ???
2023-05-19 18:27:26 +00:00: Error reading from upstream: Error reading socket: Connection reset by peer (IO::Error)
from /usr/share/crystal/src/io/evented.cr:61:9 in 'unbuffered_read'
from /usr/share/crystal/src/io/buffered.cr:80:16 in 'read'
from /usr/share/crystal/src/openssl/bio.cr:46:13 in '->'
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 in '??'
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 in 'BIO_read'
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 in '??'
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 in '??'
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 in '??'
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 in '??'
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1 in 'SSL_read'
from /usr/share/crystal/src/openssl/ssl/socket.cr:129:5 in 'unbuffered_read'
from /usr/share/crystal/src/io/buffered.cr:261:5 in 'fill_buffer'
from /usr/share/crystal/src/io/buffered.cr:83:9 in 'read'
from /usr/share/crystal/src/io.cr:540:20 in 'read_fully?'
from /usr/share/crystal/src/io.cr:523:5 in 'read_fully'
from /tmp/amqproxy/lib/amq-protocol/src/amq/protocol/frames.cr:26:9 in 'read_loop'
from /tmp/amqproxy/src/amqproxy/upstream.cr:34:7 in '->'
from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
from /usr/share/crystal/src/fiber.cr:98:34 in '->'
from ???
2023-05-19 18:27:26 +00:00: Error reading from upstream: End of file reached (IO::EOFError)
from /usr/share/crystal/src/io.cr:523:27 in 'read_fully'
from /tmp/amqproxy/lib/amq-protocol/src/amq/protocol/frames.cr:26:9 in 'read_loop'
from /tmp/amqproxy/src/amqproxy/upstream.cr:34:7 in '->'
from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
from /usr/share/crystal/src/fiber.cr:98:34 in '->'
from ???
I wondered if maybe there were too many open sockets on the pod, so I ran this - might this also be useful/helpful?
root@qa-monometa-68785f7f66-dmkph:/code# cat /proc/sys/fs/file-nr
6112 0 26099287
root@qa-monometa-68785f7f66-dmkph:/code# sysctl fs.file-max
fs.file-max = 26099287
What's really weird is that I have two different types of application pods/containers that rely on a local amqproxy...both of them threw these same errors together, at the same time. None of my other pods did. The weird part is that those two unrelated application pods happen to reside on the same EC2 host, and they're the only two on it.
These seem to be problematic too:
Intel Xeon Platinum 8175
At this point, I might just be speaking to myself...but:
I constrained our autoscaling systems to only using Intel Xeon 8375C (Ice Lake) based Instance families. I haven't seen the behavior manifest since making that change.
It looks like AMQProxy
(or some underlying dependency of it) only likes to run on c6, m6, and r6
instance types.
This is painful, since we'll be missing out on huge operational and cost-related savings by not being able to use SPOT instances from the c5, m5, r5
instance types, which can be used on-demand at cost savings of up to 90%.
On a hunch, I'm going to attempt to experiment with this a bit further by setting the following environment variable in all containers:
OPENSSL_ia32cap=:~0x20000000
This experiment/hypothesis is based on pretty much the only lead I've been able to find on this --> https://www.intel.com/content/www/us/en/developer/articles/troubleshooting/openssl-sha-crash-bug-requires-application-update.html
Why are you compiling it yourself? Still seeing issues when running the "official" amqproxy image and/or the official deb packages?
the EOFError is just a normal network disruption. Not related to the "Invalid instruction".
You should analyze the coredump: gdb <executable> <core-file>
, issue bt
to get the backtrace.
@carlhoerberg - Thanks for making the time to reply this thread. It's greatly appreciated. 🙌
To confirm, I see the issue even when using official amqproxy image or official deb packages from packagecloud.
From an earlier post to this thread:
I've now ripped out the manual Docker builds and the official Alpine-based Docker build, and I'm installing AMQProxy through the deb packages you guys provide in PackageCloud.
Even with the official images and/or packages, I continued to see the proxy crush and dump behavior.
Out of curiosity, where would I find the actual location of the core dump (bullseye-11 docker image)?
The core dump is dumped on the host machine, see /proc/sys/kernel/core_pattern
for where to. You just need to increase the core dump size limit for the container with something like: docker run --ulimit core=-1 ...
or something along those lines with kubernetes.
more info: https://stackoverflow.com/questions/28335614/how-to-generate-core-file-in-docker-container
The tricky thing when running in a container is that you have to give gdb the path to the binary, which is in the image, but should be possible somehow from the host machine.
@carlhoerberg
I'm having a hard time getting the dump stuff going...but I am seeing a ton of these in the dmesg for the EC2 host where amqproxy runs inside the container:
[220027.916550] uprobe: amqproxy:27836 failed to handle uretprobe, sending SIGILL.
[220028.996404] uprobe: amqproxy:28142 failed to handle uretprobe, sending SIGILL.
[220030.070610] uprobe: amqproxy:28420 failed to handle uretprobe, sending SIGILL.
[220031.117746] uprobe: amqproxy:28684 failed to handle uretprobe, sending SIGILL.
[220032.177014] uprobe: amqproxy:28911 failed to handle uretprobe, sending SIGILL.
[220033.264837] uprobe: amqproxy:29173 failed to handle uretprobe, sending SIGILL.
[220034.347359] uprobe: amqproxy:29488 failed to handle uretprobe, sending SIGILL.
[220035.446718] uprobe: amqproxy:29843 failed to handle uretprobe, sending SIGILL.
[220036.522884] uprobe: amqproxy:30161 failed to handle uretprobe, sending SIGILL.
[220037.607774] uprobe: amqproxy:30442 failed to handle uretprobe, sending SIGILL.
[220038.658550] uprobe: amqproxy:30721 failed to handle uretprobe, sending SIGILL.
[220039.744033] uprobe: amqproxy:31002 failed to handle uretprobe, sending SIGILL.
[220040.838408] uprobe: amqproxy:31297 failed to handle uretprobe, sending SIGILL.
[220041.932248] uprobe: amqproxy:31648 failed to handle uretprobe, sending SIGILL.
[220043.029891] uprobe: amqproxy:31950 failed to handle uretprobe, sending SIGILL.
[220044.117301] uprobe: amqproxy:32261 failed to handle uretprobe, sending SIGILL.
[220045.170641] uprobe: amqproxy:32595 failed to handle uretprobe, sending SIGILL.
[220046.261596] uprobe: amqproxy:337 failed to handle uretprobe, sending SIGILL.
[220047.339998] uprobe: amqproxy:639 failed to handle uretprobe, sending SIGILL.
[220048.428183] uprobe: amqproxy:891 failed to handle uretprobe, sending SIGILL.
Kernel:
[root@ip-172-17-20-251 ~]# uname -r
5.10.178-162.673.amzn2.x86_64
Each time I kill a faulty pod (that's throwing the rabbit errors from PHP, and the core dumps from amqproxy) --> this is what's in DMESG
[222213.088596] IPv6: ADDRCONF(NETDEV_CHANGE): eni6875695a92c: link becomes ready
[222213.094548] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[222213.316574] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/dotenv/app/0 supports timestamps until 2038 (0x7fffffff)
[222213.329146] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/dotenv/app/1 supports timestamps until 2038 (0x7fffffff)
[222213.341627] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/app-configs/app/6 supports timestamps until 2038 (0x7fffffff)
[222213.354194] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/millicast-cert/app/7 supports timestamps until 2038 (0x7fffffff)
[222213.366945] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/millicast-cert/app/8 supports timestamps until 2038 (0x7fffffff)
[222213.379611] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/10 supports timestamps until 2038 (0x7fffffff)
[222213.392558] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/12 supports timestamps until 2038 (0x7fffffff)
[222213.405215] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/14 supports timestamps until 2038 (0x7fffffff)
[222213.417979] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/16 supports timestamps until 2038 (0x7fffffff)
[222213.520975] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/18 supports timestamps until 2038 (0x7fffffff)
[222213.533701] xfs filesystem being remounted at /var/lib/kubelet/pods/2a6e419e-3b72-4573-a5d7-d5807deb2fed/volume-subpaths/user-bundle/app/20 supports timestamps until 2038 (0x7fffffff)
[222220.376781] uprobe: amqproxy:20278 failed to handle uretprobe, sending SIGILL.
[222477.354222] IPv6: ADDRCONF(NETDEV_CHANGE): eni4a8f9942555: link becomes ready
[222477.360268] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[222882.348690] IPv6: ADDRCONF(NETDEV_CHANGE): enidb76c2cff86: link becomes ready
[222882.354709] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[222882.580981] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/dotenv/app/0 supports timestamps until 2038 (0x7fffffff)
[222882.593362] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/dotenv/app/1 supports timestamps until 2038 (0x7fffffff)
[222882.605858] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/app-configs/app/6 supports timestamps until 2038 (0x7fffffff)
[222882.620086] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/millicast-cert/app/7 supports timestamps until 2038 (0x7fffffff)
[222882.633008] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/millicast-cert/app/8 supports timestamps until 2038 (0x7fffffff)
[222882.645838] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/10 supports timestamps until 2038 (0x7fffffff)
[222882.658491] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/12 supports timestamps until 2038 (0x7fffffff)
[222882.671080] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/14 supports timestamps until 2038 (0x7fffffff)
[222882.683587] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/16 supports timestamps until 2038 (0x7fffffff)
[222882.696278] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/18 supports timestamps until 2038 (0x7fffffff)
[222882.708908] xfs filesystem being remounted at /var/lib/kubelet/pods/d50a36a3-d6bc-4645-8f63-75af66be1e35/volume-subpaths/user-bundle/app/20 supports timestamps until 2038 (0x7fffffff)
[222889.379459] uprobe: amqproxy:10850 failed to handle uretprobe, sending SIGILL.
On the EC2 host, I ran:
27 echo '/cores/core.%e.%p' | sudo tee /proc/sys/kernel/core_pattern
Then I re-launched a pod on the host that's experiencing the issue, and saw it dump out.
I installed GDB and ran the following:
root@qa-meta-d998c97f5-jt22x:/cores# gdb /usr/bin/amqproxy core.amqproxy.60
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/amqproxy...
[New LWP 60]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `amqproxy --listen=127.0.0.1 --port=5673 --idle-connection-timeout=86400 amqps:/'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007fffffffe001 in ?? ()
root@qa-meta-d998c97f5-jpnk6:/cores# gdb /usr/bin/amqproxy core.amqproxy.60
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
Attempting to load the dump and then using backtrace:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/amqproxy...
[New LWP 60]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `amqproxy --listen=127.0.0.1 --port=5673 --idle-connection-timeout=86400 amqps:/'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007fffffffe001 in ?? ()
(gdb) bt
#0 0x00007fffffffe001 in ?? ()
#1 0x00007f03a332d000 in ?? ()
#2 0x00007f03a331acf9 in ?? ()
#3 0x0000000000000000 in ?? ()
Uprobe is "User-level dynamic tracing" probes, from perf, ebpf or similar. Maybe it's a kernel bug, which kernel version are you using?
amqperf is not doing any tracing, it's something else that tries to trace it.
Can you type bt
in gdb so that we get the stacktrace?
On Thu, 25 May 2023, 09:00 Armen Rostamian, @.***> wrote:
Here is the dump, I did it:
@.***:/cores# gdb /usr/bin/amqproxy core.amqproxy.60 GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/bin/amqproxy... [New LWP 60] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `amqproxy --listen=127.0.0.1 --port=5673 --idle-connection-timeout=86400 amqps:/'. Program terminated with signal SIGILL, Illegal instruction.
0 0x00007fffffffe001 in ?? ()
— Reply to this email directly, view it on GitHub https://github.com/cloudamqp/amqproxy/issues/117#issuecomment-1562381517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABL6TTTQKQMDKTZ6RRRKQTXH37Q3ANCNFSM6AAAAAAYFIX2H4 . You are receiving this because you were mentioned.Message ID: @.***>
Thanks for the quick and kind reply, @carlhoerberg .
From my last reply, this section was the result of me typing "bt" (or backtrace). You can see the bt
part two lines down from where it says Program terminated with signal SIGILL, Illegal instruction.
This is on AL2 (amazon linux2)'s default kernel, which I believe is 5.10.178-162.673.amzn2
(they compile their own)
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/amqproxy...
[New LWP 60]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `amqproxy --listen=127.0.0.1 --port=5673 --idle-connection-timeout=86400 amqps:/'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007fffffffe001 in ?? ()
(gdb) bt
#0 0x00007fffffffe001 in ?? ()
#1 0x00007f03a332d000 in ?? ()
#2 0x00007f03a331acf9 in ?? ()
#3 0x0000000000000000 in ?? ()
the isolated bt
part:
(gdb) bt
#0 0x00007fffffffe001 in ?? ()
#1 0x00007f03a332d000 in ?? ()
#2 0x00007f03a331acf9 in ?? ()
#3 0x0000000000000000 in ?? ()
I'm not sure what the recommended way would be to get the debug symbols we need...I think those are missing?
@carlhoerberg --> I've found the cause.
https://docs.px.dev/about-pixie <--- it's this thing. This thing is the culprit.
I was able to run my cluster long enough, and expose it to enough traffic that I was able to narrow it down and then reproduce the problem.
Here's how it goes:
nri-bundle
chart, and in there, we enable pixie
for tracing, etcamqproxy
suddenly crash, restart, and then go into a crashloop -->It looks like this:
# amqproxy dies via s6-overlay restarted
./run: line 11: 29182 Illegal instruction (core dumped) amqproxy --listen=127.0.0.1 --port="${RABBITMQ_PROXY_PORT}" --idle-connection-timeout="${AMQPROXY_IDLE_TIMEOUT}" amqps://"${RABBITMQ_PROXY_HOST}":"${RABBITMQ_PORT}" --debug
# s6-overlay restarts the proxy
2023-05-26 13:59:24 +00:00: Proxy upstream: REDACTED_AMQ_ENDPOINT.mq.us-west-2.amazonaws.com:5671 TLS
2023-05-26 13:59:24 +00:00: Proxy listening on 127.0.0.1:5673
2023-05-26 13:59:24 +00:00: Client connected: 127.0.0.1:39160
2023-05-26 13:59:24 +00:00: Client connected: 127.0.0.1:39162
2023-05-26 13:59:24 +00:00: Client connected: 127.0.0.1:39184
# code attempts to connect
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
# code barfs up errors
In Rabbitmq.php line 278:In Rabbitmq.php line 278:
Socket error: could not connect to host. Socket error: could not connect to host.
In Rabbitmq.php line 276:In Rabbitmq.php line 276:In Rabbitmq.php line 278:
Socket error: could not connect to host. Socket error: could not connect to host.
Socket error: could not connect to host.
worker:download:user-listworker:conversation-notification
In Rabbitmq.php line 276:
Socket error: could not connect to host.
worker:upload:group-cancelled
# hell-loop repeats
2023-05-26 13:59:25 +00:00: Proxy upstream: REDACTED_AMQ_ENDPOINT.mq.us-west-2.amazonaws.com:5671 TLS
2023-05-26 13:59:25 +00:00: Proxy listening on 127.0.0.1:5673
2023-05-26 13:59:25 +00:00: Client connected: 127.0.0.1:39992
2023-05-26 13:59:25 +00:00: Client connected: 127.0.0.1:40006
2023-05-26 13:59:25 +00:00: Client connected: 127.0.0.1:40022
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
In Rabbitmq.php line 278:In Rabbitmq.php line 278:
Socket error: could not connect to host. Socket error: could not connect to host.
In Rabbitmq.php line 278:
Socket error: could not connect to host.
In Rabbitmq.php line 276:In Rabbitmq.php line 276:
Socket error: could not connect to host.
Socket error: could not connect to host.
worker:upload:group-cancelledworker:download:user-list
In Rabbitmq.php line 276:
Socket error: could not connect to host.
worker:conversation-notification
2023-05-26 13:59:26 +00:00: Proxy upstream: REDACTED_AMQ_ENDPOINT.mq.us-west-2.amazonaws.com:5671 TLS
2023-05-26 13:59:26 +00:00: Proxy listening on 127.0.0.1:5673
2023-05-26 13:59:26 +00:00: Client connected: 127.0.0.1:40858
2023-05-26 13:59:26 +00:00: Client connected: 127.0.0.1:40868
2023-05-26 13:59:26 +00:00: Client connected: 127.0.0.1:40876
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
qa: Socket error: could not connect to host. FROM: /code/vendorCustom/src/Queue/Rabbitmq.php
# ...and so on, and so on...until I uninstalled NewRelic's monitoring chart + included pixie chart and tracers
# and suddenly...
2023-05-26 13:59:30 +00:00: Proxy upstream: REDACTED_AMQ_ENDPOINT.mq.us-west-2.amazonaws.com:5671 TLS
2023-05-26 13:59:30 +00:00: Proxy listening on 127.0.0.1:5673
2023-05-26 13:59:30 +00:00: Client connected: 127.0.0.1:59872
2023-05-26 13:59:30 +00:00: Client connected: 127.0.0.1:59884
2023-05-26 13:59:30 +00:00: Client connected: 127.0.0.1:59886
# Armen removes newrelic-pixie from the cluster...
#
# All services go back to functioning normally,
#
# and all all "[27961.497665] uprobe: amqproxy:27986 failed to handle uretprobe, sending SIGILL." messages disappear from dmesg
So, something in the NewRelic chart -- I'm guessing pixie
is triggering this. Oddly, when you reinstall pixie
again, everything continues to function for a while...and then, without warning, it will strike at random on any given host, at some random point in time.
Up to this point in debugging this issue, I was having to delete "faulty" EC2 worker nodes that were manifesting the problem...while the NewRelic bundle and its included pixie
were installed.
Then it would strike again, without warning, and at random.
Important to note: A "faulty" EC2 node with the problem suddenly went back to working/functioning perfectly fine (amqproxy working without issue inside the container in the kubernetes pod) the very moment I removed pixie
from the cluster.
Does this help to make anything any more clear?
Just throwing some things out there:
@armenr which version of Kubernetes are you running? If I understand it correctly, you need at least 1.27 for eBPF support: https://github.com/awslabs/amazon-eks-ami/issues/728, https://github.com/awslabs/amazon-eks-ami/pull/1223
Or maybe something like https://github.com/iovisor/bcc/issues/1320 is going on?
More links: https://github.com/golang/go/issues/22008, https://github.com/iovisor/bcc/issues/3034, https://github.com/golang/go/issues/27077
Reading more about this, it seems clear that we are in the same both as Go here, from https://github.com/golang/go/issues/27077#issuecomment-415141461
The only difference is that Go depends on its ability to unwind stacks for GC and stack growth. I assert that uretprobes would break stack unwinding in any language, and regardless of calling convention. I'm actually really curious how it interacts with C++ exception handling; I suspect uretprobes breaks it.
uretprobes clobbers critical unwinding information and, as far as I can tell, doesn't provide a way to get it back. However, I would love to be proved wrong, since I know how powerful uprobes can be.
Crystal also does stack unwinding, so disabling Pixie was the right call (and nice work @armenr to eventually finding that), to avoid these crashes. It is unfortunate, as everything on https://docs.px.dev/about-pixie/pixie-ebpf sounds pretty cool, but that's the situation.
I'm going to close this issue as we can't address it in this project.
Hey all, Pixie core maintainer here. Sorry to hear that these uprobes were causing amqproxy issues.
I know you mentioned that you uninstalled Pixie from your cluster, but we (the maintainers) care about addressing issues like this. I've created https://github.com/pixie-io/pixie/issues/1970 to provide a mechanism for opting applications out of uprobe instrumentation. Wanted to mention that in case anyone else runs into this issue or you are interested in trying Pixie again.
I'm having one heck of a time debugging some intermittent issues I'm seeing with AMQProxy.
I have a pod in Kubernetes that runs AMQProxy inside. Most of the time, AMQProxy works great. No issue. We've used it a bunch, run our automated QA against the servers and applications that are now using AMQProxy...most of the time, no issues! Just awesomeness!
But, I'm seeing an intermittent issue with the proxy.
We have some scripts that need to run before our PHP-FPM app starts up. Those scripts do a bunch of PHP-related things, and then, after that's done, we start php-fpm.
Sometimes, our code will throw an error when starting. Right before our code throws the error, it looks like the proxy throws a fault and crashes.
At the start of the container startup logs, I clearly see:
So, I know for sure that the proxy is up and listening/working.
Half a second later, I see these:
SO, at this point, I know for a fact that the proxy is accepting and serving connections, and everything's working.
Then, a half-second-or-so later
This happens intermittently. There's no trigger or condition I've been able to find or use, in order to reproduce this consistently. It just happens at random.
As you can see at the end of the last log snippet, the proxy crashes, which causes the app to crash, and then s6-overlay restarts the proxy instantly, and the proxy comes right back...but our initialization script breaks, and the app never starts.
I'm wondering two things:
--debug
set on amqProxy87 Illegal instruction (core dumped)
mean?If it helps, this is how we run AMQProxy with s6-overlay (as an s6-rc service)