SkyAPM / SkyAPM-php-sdk

Replaced by https://github.com/apache/skywalking-php
https://skywalking.apache.org/
Apache License 2.0
421 stars 104 forks source link

大并发请求redis导致sky报错::sky_request_flush message_queue exNo such file or directory #468

Closed lvxiao1 closed 2 years ago

lvxiao1 commented 2 years ago

系统信息:

用ab压测开始时能正常上报数据, 一段时间sky无法接收到数据,查看sdk日志报错sky_request_flush message_queue exNo such file or directory,但是重启fpm后问题不会再重现, 重启命令ps -ef|grep php-fpm|grep master|grep -v grep|awk '{print $2}'|xargs kill -USR2 ab -n 10000 -c 50 http://127.0.0.1:18880/redis.php

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib:/usr/local/lib64 ENV LD_RUN_PATH=$LD_RUN_PATH:/usr/local/lib:/usr/local/lib64 ENV WORKDIR /webser/www ENV GROUP_NAME test ENV SERVICE_NAME skywalking WORKDIR /webser/www

RUN sed -i "s/deb.debian.org/mirrors.163.com/g" /etc/apt/sources.list \ && sed -i "s/security.debian.org/mirrors.163.com/g" /etc/apt/sources.list \ && apt-get clean \ && apt-get update --fix-missing \ && apt-get install -y build-essential autoconf automake libtool curl make g++ unzip pkg-config cmake libboost-all-dev libcurl4-openssl-dev zlib1g-dev nginx git \ && docker-php-ext-install zip \ && pecl channel-update pecl.php.net \ && pecl install redis \ && docker-php-ext-enable redis

RUN git clone --depth 1 -b v1.34.x https://github.com/grpc/grpc.git /var/local/git/grpc \ && cd /var/local/git/grpc \ && git submodule update --init --recursive \ && mkdir -p cmake/build \ && cd cmake/build \ && cmake ../.. \ && make -j$(nproc) \ && echo "--- INSTALL skywalking php ---" \ && cd /var/local/git \ && curl -Lo v4.2.0.tar.gz https://github.com/SkyAPM/SkyAPM-php-sdk/archive/v4.2.0.tar.gz \ && tar zxvf v4.2.0.tar.gz \ && cd SkyAPM-php-sdk-4.2.0 \ && phpize && ./configure --with-grpc=/var/local/git/grpc && make && make install \ && rm -fr /var/local/git \ && ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \ && mkdir -pv /data/log /var/log/php /webser/www /var/tmp/nginx

COPY php.ini /usr/local/etc/php/conf.d/ COPY www.conf /usr/local/etc/php-fpm.d/ COPY nginx.conf /etc/nginx/ COPY app.conf /etc/nginx/conf.d/app.conf

COPY run.sh /tmp/ COPY reload-php-ini.sh /tmp/

RUN chmod +x /tmp/run.sh \ && chmod +x /tmp/reload-php-ini.sh

EXPOSE 18880 ENTRYPOINT ["/tmp/run.sh"]

- php.ini
```ini
extension=skywalking.so
skywalking.app_code = ${GROUP_NAME}::${SERVICE_NAME}
skywalking.enable = 1
skywalking.version = 8
skywalking.log_enable = 1
skywalking.grpc = ${SKYWALKING_SERVER_ADDR}
skywalking.error_handler_enable = 0
skywalking.log_path = /data/log/skywalking.log
skywalking.mq_max_message_length = 1048500

pm = static pm.max_children = 5 slowlog = /data/php-slow.log request_slowlog_timeout = 5s request_terminate_timeout = 20s

clear_env = no catch_workers_output = yes php_admin_flag[expose_php] = off

[global] error_log = /data/php-error.log

heyanlong commented 2 years ago

This happens when resources are insufficient.

heyanlong commented 2 years ago

What is the ulimit

lvxiao1 commented 2 years ago

root@e125c400e345:/data/log# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 39834 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

lvxiao1 commented 2 years ago

@heyanlong 我调试发现是boost::interprocess::message_queue被删除才会报sky_request_flush message_queue exNo such file or directory, 但是sky_module_cleanup方法没有被调用过,什么情况下这个队列会被删除呢? 还有我reload fpm之后加几倍的压力进行压测也没出现这种情况

heyanlong commented 2 years ago

正常来说,queue不会莫名其妙被删除

lvxiao1 commented 2 years ago

我在创建、删除和打开message_queue加了日志, 正常上报两次之后就会报错, 查看了下/dev/shm skywalking_queue_12 确实被删除了 image

image

heyanlong commented 2 years ago

看来需要添加个重新创建机制。。要不要提交个pr?

lvxiao1 commented 2 years ago

@heyanlong 可以,但莫名其妙被删除, 如果加上重新创建机制,会不会一直删除重建,不断申请和释放共享内存而影响性能

lvxiao1 commented 2 years ago

目前问题已经排查出来了,由于nginx配置了fastcgi_cache_path /dev/shm,导致nginx回收时把 /dev/shm下的共享内存全清空了。最终配置成 fastcgi_cache_path /dev/shm/cache 解决问题 image image

lvxiao1 commented 2 years ago

@heyanlong 看来需要添加个重新创建机制。。要不要提交个pr?

消费者的接收是阻塞的,但message_queue不会监听shm回收,并且message_queue.remove方法也不会notify,所以会导致无限期的等待下去。 是否可以换成timed_receive, 设置成一秒超时, 如果返回true就继续读取,返回false则重新打开message_queue, 捕获到not_such_file_or_directory异常重新创建队列 image image

heyanlong commented 2 years ago

提交个pr吧