Closed lukeoftheshire closed 2 years ago
Hi @Linoy339
That was a good catch! For some reason our yaml file and the the actual container were mismatched - the yaml file said 2021, but the container was using the latest
image! Aditya and I have fixed this so we should now be using the 2021 image for production.
Also, I have tried the staging API and I am able to make some basic requests successfully.
OKay. Let me update the staging stack once again and check. will revert you here soon
@lukeoftheshire . It worked after I removed some labels from staging-worker stack :
traefik.enable: 'true'
traefik.docker.network: '****'
traefik.http.routers.lamp_worker_staging.entryPoints: '*****'
traefik.http.routers.lamp_worker_staging.tls.certresolver: '****'
traefik.http.services.lamp_worker_staging.loadbalancer.server.port: ****
@lukeoftheshire . Can you please restart LAMP_server and LAMP_worker, as nats found to be disconnected.
Hi @Linoy339 - I have restarted the server and worker services!
Thanks @lukeoftheshire
@lukeoftheshire . The lamp_worker staging instance is not updating with latest code changes. Can you please check on this?
Hi @Linoy339 - after doing an image update for lamp_worker, I also did a little digging and it looks like the lamp_worker service is using an image from ~29 hours ago, which appears to match up timewise with the latest release found at https://github.com/BIDMCDigitalPsychiatry/LAMP-worker/pkgs/container/lamp-worker/versions.
Is there a later version that should be being used?
@lukeoftheshire . Apologies. Now, it seems like updated. I am not sure what happened, may be the update happened little late
Thank you
@lukeoftheshire Can you please restart LAMP_server and LAMP_worker for production environment ?
@Linoy339 - Done!
Thank you
@lukeoftheshire . It worked after I removed some labels from staging-worker stack :
traefik.enable: 'true' traefik.docker.network: '****' traefik.http.routers.lamp_worker_staging.entryPoints: '*****' traefik.http.routers.lamp_worker_staging.tls.certresolver: '****' traefik.http.services.lamp_worker_staging.loadbalancer.server.port: ****
This is a good catch, thanks for letting us know!
@lukeoftheshire Can you please restart LAMP_server and LAMP_worker for production environment ?
@Linoy339 Done! Apologies for the delay.
@lukeoftheshire . We were troubleshooting this issue. Can you please confirm : 1) Can you please confirm hostname of LAMP_nats ? 2) Is there any permission issue for connecting Lamp_server and Lamp_nats at this moment?
It seems like Lamp_server and Lamp_worker got crashed and restarted. Its unable to connect nats there after.
@Linoy339 You should be able to see all these details in the console in the Stack page. There's no public hostname for NATS, so it's nats:4222
.
As it's been a few weeks, I need to ask -- are we getting closer to resolving this issue at all?
@avaidyam . We tried in many ways. We were testing staging with fixes. We checked with memory and all those were fine. We observed that the production along with staging is getting restarted at some point. This is where we stuck with the question that why the both stacks getting restarted.
Hi @avaidyam We will be updating some possible fix asap and will let you know. First we are trying to have the worker re-connect successfully to NATS even if its re-started A fix for that will be updated asap. Once that is done then we will have time to analyze why the whole stack is getting restarted and figure out the root cause. We will keep you updated on this.
@Linoy339
LAMP-server:
[ioredis] Unhandled error event: Error: connect ECONNREFUSED 10.0.6.7:6379
LAMP-redis:
1:signal-handler (1638439813) Received SIGTERM scheduling shutdown...
1:M 02 Dec 2021 10:10:13.793 # User requested shutdown...
1:M 02 Dec 2021 10:10:13.793 # Redis is now ready to exit, bye bye...
LAMP-nats:
[1] 2021/12/02 08:10:24.346357 [ERR] 10.0.6.4:60046 - cid:137 - processPub Parse Error: 'lamp.dashboard.experimental.Anxiety Short Form_graph 176'
[1] 2021/12/02 08:10:24.487041 [ERR] 10.0.6.4:60050 - cid:138 - processPub Parse Error: 'lamp.dashboard.experimental.Mood Short Form_graph 170'
[1] 2021/12/02 08:10:29.687110 [ERR] 10.0.6.4:60052 - cid:139 - processPub Parse Error: 'lamp.dashboard.experimental.Anxiety Short Form_graph 176'
I'm not sure why Redis is being shutdown by Docker, but it's not being killed for out-of-memory. There may be a debug log mode to enable for it.
NATS might be crashing due to this error. It looks like LAMP-server
is allowing the string lamp.dashboard.experimental.Anxiety Short Form_graph 176
to be used as a key, which is an illegal reverse DNS identifier. The proper regular expression for a reverse DNS identifier is:
^[A-Za-z]{2,6}((?!-)\.[A-Za-z0-9-]{1,63}(?<!-))+$
//cc @lukeoftheshire Can you try to reproduce a crash by poking a new value for one of these keys that has a space in it?
The Redis server is being stopped by docker when the stack is getting restarted and the docker is sending SIGTERM signal to Redis process. so its not an issue of Redis. we need to figure out why the stack is getting restarted.
@Linoy please confirm in production all the NATS tokens are custom tokens so illegal token exception should not cause this issues. in staging we have automation where you can push on different tokens but that is still not pushed to production and we have this issue in production as well. Linoy has already made the check for illegal tokens in his local code and this will be update3d with the fix we are trying for NATS reconnect.
@lukeoftheshire .Also, Can you restart Lamp-server and Lamp-worker, as the nats is disconnected now?
@avaidyamlamp.dashboard.experimental.Anxiety Short Form_graph 176
. Its used in tags. Right?
Also we need to put the depends_on key in the docker compose of the worker service, so that NATS/Redis is restarted properly before the worker. We will test this in staging along with other fix.
depends_on:
@michaelmenon depends_on
is not supported by Docker Swarm:
The depends_on option is ignored when deploying a stack in swarm mode with a version 3 Compose file.
ok thanks. Then there must be some other flag to define the start order in swarm mode. Will check we need to use that.
Also we should use this flag --log-opt max-size=10m --log-opt max-file=2 with docker run for each container to prune the log files produced by each docker container. I think the docker log files produced by each container is growing to a big size and take-up a lot of space.
I think instead of depending on docker flags we already have the reconnection logic in the worker code just need to make sure that actually waits until the NATS is not up. But needs to have the flags to prune the docker logs of each container so that they don't grow too big and consume all the space causing a stack restart.
Need to put this in each of the docker service in the compose file logging: driver: "json-file" options: max-size: "50m"
Also we should use this flag --log-opt max-size=10m --log-opt max-file=2 with docker
These were already set quite some time ago by @lukeoftheshire for LAMP-server
. NATS and Redis don't take up much logging space, so @lukeoftheshire, could you copy/paste the same logging options for LAMP-dashboard
and LAMP-worker
? It'll require a restart.
I think instead of depending on docker flags we already have the reconnection logic in the worker code just need to make sure that actually waits until the NATS is not up.
Agreed - the waiting logic must be in both LAMP-server
and LAMP-worker
.
@avaidyam I have updated the loggin options for dashboard and worker - they have restarted so I assume those restrctions now apply
On a related note, I set the 'lamp.dashboard.experimental.Anxiety Short Form_graph 176'
tag on one of my example particpants and nothing significant seems to have happened on the server as of a few minutes after setting that tag.
@lukeoftheshire . Can you restart Lamp-server and Lamp-worker in production ? Also please pull latest image for LAMP-worker in production
@Linoy339 both done!
Lamp-worker is started properly. But, it seems LAMP-server didn't Can you check @lukeoftheshire ?
I restarted it again - it looks to be running now!
@avaidyam . @lukeoftheshire . We have done some fixes in staging environment. @lukeoftheshire . Can you please restart LAMP-server for production ?
@Linoy339 was this intentionally closed? it is still in testing status and being worked on, no?
Apologies. It was by mistake.
@lukeoftheshire .Please make sure that LAMP-server for production is restarted
Hi @Linoy339 Apologies for the delay! LAMP-server is restarted in prod.
Hi @avaidyam @lukeoftheshire can you let me know if Docker was updated recently I mean 3-4 weeks back ?
When I checked the LAMP server and Worker exit code is 137 and usually 137 means OOM error out of memory error but you can check if its out of memory in the inspect container section and there it clearly says OOMKilled as false so its not out of memory. And while researching I found this thread in the Docker forum:
Other users have also reported similar kind of issues where the containers are being stopped after 2-3 days with exit 137 and no OOMKilled flag. Please let us know the Docker version running on the production we may try to update the Docker version and check.
node-01
was upgraded about 4 months ago due to a version of Amazon Linux 1 being deprecated. It's running Docker v20, and node-02
is still running Docker v18. I would also note that it's possible to have an error code 137
and OOMKilled = false
if the application (i.e. Java or NodeJS) killed itself before Docker killed it for a memory violation. This could be the underlying issue, but I wonder if there's something happening to both LAMP-server
and LAMP-worker
that causes an OOM error for NodeJS? (The issue is we haven't seen the NodeJS OOM exception log messages either.)
ok yes not seeing any OOM. exception for for Node js. Can you run "dmesg" command in the node-01 machine terminal. This will show the signals sent by the Kernal to the processes . Please see if there is a "Killed process" message.
OOMKilled is set to false if Docker itself kills the process but it will not be set if Kernel kills a process for out of memory. We can see the "Killed process" signal sent by kernel by running this command "dmesg" in the command terminal of the host machine itself. Can you please check on that.
Hi @michaelmenon @avaidyam - running dmesg -T | grep -i Killed
in node-01 reveals two instances when the redis-server process was killed - however both timestamps were in October (the 15th and the 23rd, for the record). There doesn't appear to be anything more recent than that.
Hope this helps - it seems to me like these specific instances are not relevant to this issue so I won't clog the page by including logs or anything, but let me know if I can give more info about either of these issues or if a screenshot/text log would be helpful.
@lukeoftheshire . Seems like the production version of LAMP-server got deployed using lamp-server:latest instead of lamp-server:2021. Please check.
Also, the the crash was reported today. You can see that production and staging stack got down at the same time. As we said, It looks strange, as these are different stacks.
Please share your thoughts. If you get any spike report from your server logs, please let us know
Thanks @lukeoftheshire . It does not seems to be from low memory as I was watching the memory for all the process in the node -01 yesterday night and all of them were taking very low memory. And the stack went down yesterday night so I don't think its because of low RAM or something. Also not sure why staging and production will go down together. Is there any auto update set for server ?
WE checked and it seems Cloudwatch doesn't have memory metrics so would be good if we can get grafana access
Hi @lukeoftheshire after you update the correct image(lamp-server:2021) to production we will update some fixes to production by tomorrow so that server and worker connects to NATS/redis correctly after restart.
Also can you check if some backup script or auto update is running on the server for any specific time ?
Hi @Linoy339 @michaelmenon . This is very strange - the YAML in portainer definitely uses the 2021 tag but the image was still 'latest'. I updated the service and it appears to be back to the 2021 release. I will look into this and see if I can figure out why this has happened twice now.
There are several scripts that run on node-02 at specific times, but none that run on node-01 to my knowledge.
A docker image can have multiple tags. If a new production release was created, it will be tagged with both :2021.x.y
and :latest
. This doesn't seem to be the case here so comparing SHA might be helpful.
Thanks @lukeoftheshire . The production is now properly tagged. It seems the fixes in staging in working fine, as the redis and nats are connecting properly to lamp server as well as lamp worker. We shall deploy in production by tomorrow morning (IST) and check there.
@lukeoftheshire @avaidyam . We suspect of auto deployment process.
Today, We have deployed dashboard
around 2 pm (IST) to production and its
seen that all stack components (production and staging) got restarted.
It seems like dashboard deployment is causing all containers to restart
and In certain or most of the times, the order of container start would be :
1)Lamp-server
2)Lamp-worker
3)Lamp-message_queue
4)Lamp-nats
which leads to the scenario that server/worker would not connect to redis and nats for the first time which in turn force us to restart server/worker containers manually, so that it get connected to redis as well as nats properly. This issue is resolved now , as the hosts tries connecting untill nats/message_queue gets alive. Also, this is pushed to production.
But, it seems the auto restart of containers is because of dashboard deployment. Also, we have done LAMP-server deployment too in production around 3.01 pm (IST ) and this led to the restart of dashboard container too. Please see the timestamps for both containers(LAMP-server and LAMP-dashboard). Can you please go through the script or yaml or anything related with CI/CD workflow?
Also, FYI: We can see that the image, Lamp-worker:latest is getting pulled by both staging as well as production stack after a merge is completed. This should happen only in staging and the production should be updated if we forcefully pull the image
As mentioned in the title message queue, nats worker, and server services appear to be stopping and starting fairly frequently in portainer. @avaidyam and I are worried that there is a possibility of data loss due to these frequent stop-starts, so we are making this issue to keep track of it and log any information we find. The first thought we have is wondering if this could be related to automation with LAMP_worker in some way?
Please see the following image for an example of the frequently restarting services.
Relevant logs:
Redis:
Server
Worker