Closed lukeoftheshire closed 2 years ago
This may also be linked to the recent logout issues we have been seeing as automations is a recently developing addition and we as a group have collectively been still trying to pinpoint a source of these issues. There may be a problem where the app or dashboard code fails to reach the server. This may make it so it thinks the credentials are not correct and deletes them. Let's look into this to see if this issue may be related to the series of automatic logouts.
@lukeoftheshire . We used to restart the stack when we see the message - Nats server is disconnected
Or similar redis disconnection error. Can you please keep an eye on memory when any of the container gets restarted and
also, how frequent its getting restarted?
@SuruPa00. This doesn't seems to be the reason for automatic logout issues.
From the logs, nats-server and redis-server is getting disconnected for some reason We shall implement more logging in codebase and check.
@Linoy339 Going forward, please do not manually restart anything - no changes to containers or. services should ever be made (only the stack file should be updated by myself once a change is made to components, which there should not be anymore).
Please resolve the actual issue. I still suspect this is linked to the automatic logouts until we have further evidence to the contrary.
thanks Aditya, I will check with the team and update.
@lukeoftheshire , @avaidyam , @SuruPa00 Here, it seems that the nats connection is getting disconnected and closed after the default retry attempts (which is 10 now) from the client. As a solution, we have changed the retry value as infinite, so that nats client will look for nats-server until it is available. This is done in staging and we can move to production after monitoring it for 2 or 3 days.
@Linoy339 Can we diagnose why NATS is being disconnected? Both Redis and NATS should never be going down as a result of our code. We aren't fixing the actual issue with the patch you described.
@avaidyam . Yes, we are diagnosing it by putting more logs. We already tackled the issue related with max payload of nats earlier. Also, we made sure that redis is reconnected if its disconnected for some reason. Also, we need to introduce the above solution, so that we can make sure that nats will be alive soon after being disconnected
@lukeoftheshire . We have seen that both LAMP_server and worker in staging as well as production got restarted. Can you please restart the stack in order to make it work correctly.
As a step of diagnosing, can you please provide the whole logs (LAMP_server,LAMP_worker,LAMP_staging_server,LAMP_staging_worker,LAMP_message_queue,LAMP_staging_message_queue) for today?
@lukeoftheshire . Can you please restart the Lamp-server and Lamp-worker(staging and production), as Lamp-server and Lamp-worker is not connected to message_queue, which blocks notification process in LAMP-worker.
@Linoy339 You should be able to access the logs from the Portainer console still. Just please don't update any services, containers, or stacks. If you need to update any stacks, @lukeoftheshire and I can help do that.
@avaidyam . If you can give full logs as txt file, it would be helpful. Also, we won't update any service anymore. Instead, we will inform @lukeoftheshire for that
@Linoy339 We don't have access to anything more than you would through the Portainer console. The logs attached above were copied out of Portainer and manually redacted to hide IDs and sensitive info.
Okay. I will try that. Also, please restart the Lamp_staging as well as Lamp stack, as still seen like nats is disconnected
@avaidyam @lukeoftheshire .
Can you please check with Lamp stack (lamp-worker and lamp-server), as it seems like not restarted properly Please restart the stack(lamp-worker and lamp-server only)
@Linoy339 we are working on it! apologies for the delay.
Hi @Linoy339 thank you for your patience - sorry about the delay. I have just restarted both lamp-worker and lamp-server. Please let me know if this fixes the issues and if you need anything else done.
@avaidyam @lukeoftheshire : No issues. It can be seen that the bull queue is not getting initialized after redis being reconnected, which should be the primary as well as the root issue out of all and this need to be resolved. Any ways, this couldn't be reproduced in our portainer environment.
Now, it got restarted properly
@avaidyam . Inorder to troubleshoot these problems, we might need to restart containers of our own. So, can you please grant us to handle the staging stack of our own?
The staging stack should be fine - NOT the production stack, though.
Sure. Thank you
@avaidyam @lukeoftheshire It seems like lamp server api is not reachable apart from LAMP-dashboard. Have you changed anything in the inbound rules or any particular network settings? Can you please check on this ?
This sounds like it could be on your end? AWS sends us a system status notification if api.lamp.digital
is unreachable from the internet.
Apologies.
Its api-staging.lamp.digital
and not api.lamp.digital
It says error as Error: connect ECONNREFUSED 3.130.237.50:443
Works on my end?
Now, it worked from the postman.
But, its not getting resolved in lamp_staging_worker. It says : error while fetching activities--- FetchError: request to https://api-staging.lamp.digital/study/sdp35cpgddmcbp3f1qak/activity?ignore_binary=true failed, reason: connect ECONNREFUSED 3.130.237.50:443
I think connection is getting refused somehow and I got the same error in postman too. Also, lamp_staging worker is unable to fetch the api calls. The following error FetchError: request to https://api-staging.lamp.digital/researcher failed, reason: connect ECONNREFUSED 3.130.237.50:443
@lukeoftheshire . Can you check
Seems to work for me as well - I can connect to the staging API.
Can you please try for some time, as we get some error in fetching. I have tried in another machine as well.
Please see the screenshot:
@Linoy339 You should be able to see the logs of what the server is receiving and denying here. If no request appears for your IP address, then it's likely on your end or a spam protection feature we're not aware of on AWS's end.
Okay. We can see that Lamp-staging-server is refusing Lamp-staging-worker.
Please find the logs here : https://console.lamp.digital/#!/4/docker/containers/5fdd1c28bcfc97cc1b07c759debced4e09dc30d6e8cb714a9532f0e4b57f0ad8/logs?nodeName=node-02.lamp.digital
Can you please try identify the issue if any exist?
@avaidyam. @lukeoftheshire.
Please find the error message which I am getting in lamp_staging_worker container logs:
"FetchError: request to https://api-staging.lamp.digital/participant/U2271608715/sensor_event?origin=lamp.analytics&limit=1000 failed, reason: connect ECONNREFUSED 3.130.237.50:443"
Also, when I tried staging instance, it says cannot restart container....no space left on device
Is there any issue related with space in the host, node-02?
It seems to be a space issue. Can you please check on volumes and related data?
Hi @Linoy339 Thank you for bringing this to our attention. This should be fixed for now although we will look into the root causes.
@lukeoftheshire . We still face issue in communication between lamp_staging_worker and lamp_staging_server, which blocks us for testing. It is suspected that space issue has not been resolved .
This is the error in lamp_staging_worker logs :
request to https://api-staging.lamp.digital/type/6jf470r7z7peb6grgfr7/attachment/lamp.automation failed, reason: connect ECONNREFUSED 3.130.237.50:443
@lukeoftheshire Please find the screenshot stating the space issue:
@lukeoftheshire . Also, We can see that both staging and production dashboards are in the host, node-02 .
Can you confirm on this, as we suspect one of them to be in node-01?
@Linoy339 Only the production server components are in node-01. We are in the process of moving both the production and staging dashboards off of the infrastructure and onto GitHub Pages, which will allow for more rapid development and less latency when accessing pages.
Thank you @avaidyam This means both dashboards were in node-02.
Hi @Linoy339 - thanks for bringing this to our attention. I have cleared out a small amount of space on the node - hopefully that is enough for now - over the course of the day I will clear out more space to stop this from becoming a recurring issue. I'll update here once that's done.
I can also confirm Aditya's statement - both dashboards are in node 2.
Thank you @avaidyam and @lukeoftheshire
And do you know by any chance, how node-02 became space out ?
We are in the process of moving both the production and staging dashboards off of the infrastructure and onto GitHub Pages, which will allow for more rapid development and less latency when accessing pages.
So, this will make dashboard to be called as static from (https://pages.github.com/)
@Linoy339 Yes that's correct. Node-02 is also used for internal development and data science so there's a lot of cache files generated. However, every time a new staging build is deployed, the used volumes and images pile up adding 6+ GB per build. We have been manually removing them thus far.
Okay. Now, we understand Thank you for the clarification
@Linoy339 @avaidyam I have removed a good deal of data from the caching directory. Disk usage is back below 50%. I will continue to monitor this, of course, but this problem should no longer occur.
Thanks @lukeoftheshire
@lukeoftheshire . We can see that the space has been restored. But, Its seems that its still an issue and can be seen in the staging-worker logs: request to https://api-staging.lamp.digital/participant/U8315566314/sensor_event?origin=lamp.analytics&limit=1000 failed, reason: connect ECONNREFUSED 3.130.237.50:443
Do we need to restart anything after clearing the space?
Hi @Linoy339 I have just restarted the lamp staging server and worker services.
I will try the staging api on my end and update this post with the results.
Okay.
Also, LAMP-server is provided with staging image - ghcr.io/bidmcdigitalpsychiatry/lamp-server:latest instead of ghcr.io/bidmcdigitalpsychiatry/lamp-server:2021.
We had a release today. we couldn't notice it earlier. Can you check this too ?
@Linoy339 :latest
is always overwritten by master
branch releases (staging). This is the correct tag to use.
Okay. From my understanding, the tag for the production will be the lamp-server:2021 and staging uses lamp-server:latest
Here is the screenshot:
It is seen that the production server uses lamp-server:latest. This is something new to us
If it is okay, this would be fine
As mentioned in the title message queue, nats worker, and server services appear to be stopping and starting fairly frequently in portainer. @avaidyam and I are worried that there is a possibility of data loss due to these frequent stop-starts, so we are making this issue to keep track of it and log any information we find. The first thought we have is wondering if this could be related to automation with LAMP_worker in some way?
Please see the following image for an example of the frequently restarting services.
Relevant logs:
Redis:
Server
Worker