Open techcouncilbox opened 4 years ago
It looks like 2 threads are triying to perform operations and obtain unconsistent info (highlighted in orange when second thread is disconnecting and after that error arises).
Maybe concurreny in these variables in KurentoParticipant need to be reviewed :
And Autowired variables marked as volatile in RPCNotificationService:
And SessionManager:
But we don't understand enough the source code to propose a valid solution
We're facing constant session crashes for users with this error on the latest release of openvidu. After a restart everything seems to work ok for a day or so and then users meetings start failing with the above error being thrown on the server. We're restarting daily at the moment. Any pointers on how to debug this would be appreciated.
@viktara can you describe your problem in more detail? Can you share an entire OpenVidu server log?
Hi @micaelgallego , Here is the link to more detail logs https://cloud.techblue.co.uk/s/8PLdi1bP1WllqGb/download Kindly let us know if you need any further information
@micaelgallego we're still experiencing multiple user session crashes each day. The only work around so far is to restart the containers and the problem goes away for a while. any pointers on where to debug would be appreciated. we're digging deeper and profiling at the moment, but are surprised that this issue is not effecting anyone else or in fact your pro customers. are you aware of any other occurrences of this issue or can you give us any further pointers here?
Hi,
We are not experiencig this kind of issues. I can't think of anybody with this same problem. How is your CPU load behavior? If your system is overloaded this kind of issues may appear.
Everything on that front seems fine, no cpu spikes, tons of unused memory. The instance is big and we have a small number of users, so I don't think it can be related to that.
A long shot, but we're going to disable the session garbage collector just in case that is causing an issue.
We have set up a new instance exactly as per the deploymet guide and started load testing using KITE. We get the same issue with the new instance, the session disconnects all users and throws the same error on the server. There is very little load on the system and it seems to happen even with very few users in a conference. We're creating a screencast in case you want to have a look. Could you help us with some pointers of where to look next?
@syam-tbl confirmed that we are not hitting 200 threads, so this issue does not seem linked to #476
@micaelgallego @pabloFuente we upgraded to the latest release and ran load tests again, but the error still occurs
Thank you for reporting. We will take a look to see what happens
Let's sort out a bit the problems described here.
This issue was started by @techcouncilbox refering a post message titled "RPCNotificationService exception with closed ws".
The suggestion made by this user was some kind of concurrency problem, but this doesn't really make sense. This message/exception appears when some client unexpectedly disconnects its websocket from OpenVidu Server. If this happens and OpenVidu is in the middle of notifying users about some event of the session, then it will throw this error. So the error is not a problem per se, it simply informs about the impossibility of sending a message to a closed client.
Then @viktara posted a new message talking about this same error. We discovered a possible bug that was affecting the amount of threads available for Tomcat in OpenVidu Server, but this seems not to be the real problem here according to https://github.com/OpenVidu/openvidu/issues/459#issuecomment-654159627.
So, from this point we need a way of replicating @viktara issue. Please, can you tell us how is your exact deployment? Did you deployed in AWS or On Premises? If it is the last one, release 2.15.0 brings a beautiful command that you can execute in your OpenVidu installation folder (default to /opt/openvidu
) to get that information:
./openvidu version
Also, if you are using a particular Cloud Provider to deploy OpenVidu On Premises, tell us which.
After having all of your deployment information, we need a way of replicating the problem using an application. WHen do you get the error? Connecting 5,6,7.... users at a time all of them publishing and subscribing to each other? What is th architecture of your sessions?
Regards.
@pabloFuente Yeah - it seems that the error is as you say and doesn't give us much information about why the video sessions die. As for the environment, we first deployed on aws and then tried another set up on digital ocean.
We have run load tests against both 2.14 and 2.15 and see the same behaviour in both and are using the openvidu-call application as our front end. I guess the problem could arise there. @syam-tbl is running the load tests and has more information so I'll leave him to pitch in here too.
@pabloFuente as @viktara mentioned we tested
2.14.0 on AWS OS : Ubuntu 18.04.3 LTS Docker version: 19.03.8 docker-compose version 1.25.5,
both 2.14.0 and 2.15.0 on Digital Ocean. OS : Ubuntu 18.04.3 LTS Docker version: 19.03.1 docker-compose version 1.26.0
We were using KITE for running the automated test. https://github.com/CoSMoSoftware/KITE-Sample-Tests/tree/master/KITE-OpenVidu-Test & yes during the test all users are publishing both voice and video
@syam-tbl can you execute ./openvidu version as suggested by @pabloFuente too please?
Sure, here is the output of that one
$ ./openvidu version
Openvidu Information:
Installation Type: On Premises
Openvidu Version: 2.15.0
Openvidu Call Version: 2.15.0
System Information:
Linux Version:
- Description: Ubuntu 18.04.3 LTS
Docker Version: 19.03.1
Docker Compose Version: 1.26.0
Installation Information:
Installation Folder: /opt/openvidu
Installation Folder Tree:
|-docker-compose.override.yml
|-openvidu
|-docker-compose.yml
|-.env
|-readme.md
|-owncert
| |-certificate.key
| |-certificate.cert
Docker Running Services:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fc9e99d30dc5 openvidu/openvidu-coturn:1.0.0 "/bin/sh -c /usr/loc_" 47 hours ago Up 47 hours openvidu_coturn_1
c711e9417269 redis:5.0.7 "docker-entrypoint.s_" 47 hours ago Up 47 hours openvidu_redis_1
a6aac117e87b openvidu/openvidu-call:2.15.0 "docker-entrypoint.s_" 47 hours ago Up 47 hours 0.0.0.0:5442->5000/tcp openvidu_app_1
04b07d9aefc3 openvidu/openvidu-server:2.15.0 "/bin/bash -c 'expor_" 47 hours ago Up 47 hours openvidu_openvidu-server_1
ba91866204e2 kurento/kurento-media-server:6.14.0 "/entrypoint.sh" 47 hours ago Up 47 hours (healthy) openvidu_kms_1
83f26c02a47f openvidu/openvidu-proxy:2.0.0 "/bin/sh -c /usr/loc_" 47 hours ago Up 47 hours openvidu_nginx_1
Describe the bug Possible Bug is described in https://openvidu.discourse.group/t/rpcnotificationservice-exception-with-closed-ws/623/6
Logs are attached in thread
Sorry for the spam but maybe it is better place to check that.
Expected behavior All rpc calls should be called with opened rpc websocket.
Wrong current behavior RPCNotificationService exception with closed ws when sending messages.
OpenVidu tutorial where to replicate the error
Client device info (if applicable) Chrome 81 with PC (windows, Mac, Unix)
Screenshots
Additional context