Closed merih-sakarya closed 9 months ago
@merih-sakarya your issues sounds like #663 one, what show the logs at provider server side and at main server side ?
@denis99999 thanks for your feedback. I contacted the network team to get more information.
For now, here are the WebSocket logs for both cases:
Websocket Logs (Not Working)
- 0{“sid":"ib-zM-zNB9ZSpiH-AAAG","upgrades":[],"pingInterval":25000,"pingTimeout":20000}
- 40
+ 42["group.invite","yWNYYJk9EWwMQn593oCfzrv1v/4=","tx.668ce94d-3cb5-4fa8-986b-336b5d2fcf5f",{"requirements":{"serial":{"value":"1b8beee7","match":"exact"}}}]
- 42["socket.ip","::ffff:10.1.0.10"]
- 42["device.change",{"important":false,"data":{"serial":"ce10160aad263e1e03","battery":{"status":"full","health":"good","source":"usb","level":100,"scale":100,"temp":23.3,"voltage":3.943}}}]
- 42["device.change",{"important":false,"data":{"serial":"ce10160aad263e1e03","battery":{"status":"full","health":"good","source":"usb","level":100,"scale":100,"temp":23.4,"voltage":3.795}}}]
Websocket Logs (Working)
- 0{“sid":"slS5P4ZqhQj8EForAAAN","upgrades":[],"pingInterval":25000,"pingTimeout":20000}
- 40
+ 42["group.invite","KgHlbSjaXDERTL7IMdnazMTO6kM=","tx.b11eafbb-325b-4b32-a925-c4d2de40b790",{"requirements":{"serial":{"value":"a4b2185f","match":"exact"}}}]
- 42["socket.ip","::ffff:10.1.0.7"]
- 42["device.log",{"serial":"a4b2185f","timestamp":1685035260.988,"priority":4,"tag":"device:plugins:group","pid":87,"message":"Now owned by \"test_user\"","identifier":"a4b2185f"}]
- 42["device.log",{"serial":"a4b2185f","timestamp":1685035260.989,"priority":3,"tag":"device:plugins:group","pid":87,"message":"Subscribing to group channel \"+ou+z8eSSH+dVaixt0Dh7w==\"","identifier":"a4b2185f"}]
- 42["device.change",{"important":true,"data":{"serial":"a4b2185f","owner":{"email":"test_user","name":"test_user","group":"+ou+z8eSSH+dVaixt0Dh7w=="},"likelyLeaveReason":"owner_change","usage":null,"using":true}}]
- 42["tx.done","tx.b11eafbb-325b-4b32-a925-c4d2de40b790",{"source":"a4b2185f","seq":0,"success":true,"data":"success","body":null}]
+ 42["tx.cleanup","tx.b11eafbb-325b-4b32-a925-c4d2de40b790"]
+ 42["user.settings.update",{"lastUsedDevice":"a4b2185f"}]
@denis99999, I've added new logs to lib/units/websocket/index.js
. In the case where it doesn't work, I noticed that the following part isn't triggered:
.on(wire.TransactionDoneMessage, function(channel, message) {
log.info(`DEBUG ::: lib.units.websocket ::: on TransactionDoneMessage ::: tx.done ::: channel : ${channel.toString()}, message : ${JSON.stringify(message)}`);
socket.emit('tx.done', channel.toString(), message)
})
Do you know where exactly 'TransactionDoneMessage' is being triggered? I have a meeting with the network team on Monday, but before that, I would like to have a clear understanding of the network and data flow.
@merih-sakarya, as said before, your problem looks like a network problem, not related to STF
itself which is working fine, that's why I advised you to take a look at STF logs everywhere on the network path , but also to take network traces (tcpdump
or wireshark
) to see what happens when the network connection is interrupted, I have no idea what your STF deployment is, but I recommend following the full deployment described in the DEPLOYMENT.md
file.
@denis99999 Thank you for your previous advice. We have thoroughly investigated our network configuration, including using network trace tools such as tcpdump and Wireshark. So far, we haven't found any unusual behaviour or errors that could be causing the issue.
We have multiple mobile devices connected and initially, all of them work perfectly fine. However, after some time, a portion of these devices starts to encounter the issue while others continue to function normally. This inconsistent behavior makes it difficult to pinpoint a network issue as the root cause.
Upon comparing the logs from when the issue arises and when it doesn't, we noticed a discrepancy. Specifically, the log entry from the malfunctioning instance indicates the device owner as null:
0.f9b6ef9ef5d285a41a39.chunk.js:90943 DEBUG ::: GroupService ::: invite ::: device : {abi: 'arm64-v8a', airplaneMode: false, battery: {…}, browser: {…}, channel: 'EEidwqTUpEh8fpLvQw59Wel8LBE=', owner: null, …}
In contrast, the logs from the working instances show an actual owner for the device. Furthermore, the working instances are successfully updating the device owner in the database.
Could the issue potentially stem from the 'group.invite' function not properly assigning an owner to the device? Any further advice or recommendations based on this information would be greatly appreciated.
@merih-sakarya No, as I said before, to my knowledge this is not a functional problem with STF, but probably a problem in your STF deployment (of which I don't know the details), this problem is either network type , either system type or hardware type (which you connect your devices to), which probably explains your trace that shows "owner" is not up to date and not that it's the update function which is faulty (otherwise it would probably be faulty all the time), to convince you of this I advise you to deploy all the STF service units locally on a single machine and you will see that this problem never happens, I'm sorry but that's all i can do for you based on the information you give me.
I have run into this same issue, with all of the services running on an ec2 instance except for the adb/provider side. Works fine all day, look the next day and the devices are listed in stf, but clicking on them to control them, gray screen and it doesn't actually connect.
If I restart just the provider portion, the devices are reaped and then re-established, and it works fine for a while again.
While it could be a networking or other "not code" issue, strictly speaking anyway, ideally stf would be explicit about failures and perhaps re-attempt to connect, or etc. As it is, because of the complexity of the infrastructure and the project itself, there is a large surface area to troubleshoot, so any information in non-happy path handling is critical.
It would be great if the software would explicitly convey a failure to establish a connection, instead of "silently fail"... and if there are situations where it could "self heal" that would be even better!
Even after reading the documentation and looking through what logging is available, for networking/hosting side I don't see any evident reason for my case either. The fact that the devices all work after I restart the provider each day, I would think perhaps a connection "implicitly/silently" times out somewhere, but am not sure where to look next.
@merih-sakarya, @spinningD20, assuming it's a network issue in the network path as I think, for example a proxy or firewall closing the TCP connection if a timer expires between your provider's machine and the main machine, you can follow these instructions on each machine:
// setting tcp keepalive properties to maintain connections to devices 24h/24h
sysctl -w net.ipv4.tcp_keepalive_time=60
sysctl -w net.ipv4.tcp_keepalive_intvl=10
// setting ZMQ tcp keepalive properties to maintain connections to devices 24h/24h
ZMQ_TCP_KEEPALIVE=1
ZMQ_TCP_KEEPALIVE_IDLE=120
Let me know if it fixes your issue?
Hi @denis99999,
I wanted to let you know that your suggestion worked for me. I applied the changes to the ZeroMQ settings as you recommended. Here's the updated code snippet I used:
var zmq = require('zeromq')
var log = require('./logger').createLogger('util:zmqutil')
module.exports.socket = function() {
var sock = zmq.socket.apply(zmq, arguments)
try {
// These parameters are important for maintaining open, idle TCP connections and ensuring that the other end of the connection is still reachable and active.
// They are particularly useful in scenarios where connections might go idle but still need to be kept open for future communication.
// For example, in a client-server application, if a client goes silent (no data exchange), these keepalive settings help the server determine if the client is still connected or has potentially lost network connectivity.
// If the client doesn't respond to a certain number of keepalive probes, the server may assume the connection is dead and close it.
//
// @see https://github.com/DeviceFarmer/stf/issues/662
// @see https://stackoverflow.com/questions/58143675/zmq-losing-subscribe-connection
// @see https://github.com/openstf/stf/issues/100
// @see https://libzmq.readthedocs.io/en/latest/zmq_setsockopt.html
sock.setsockopt(zmq.ZMQ_TCP_KEEPALIVE, 1) // Enables TCP keepalive. The default is 0 (off).
sock.setsockopt(zmq.ZMQ_TCP_KEEPALIVE_IDLE, 300) // Sets the idle time before the first keepalive probe is sent to 300 seconds (5 minutes).
sock.setsockopt(zmq.ZMQ_TCP_KEEPALIVE_CNT, 10) // Specifies the number of keepalive probes to send before considering the connection dead if no response is received.
sock.setsockopt(zmq.ZMQ_TCP_KEEPALIVE_INTVL, 300) // Sets the interval between individual keepalive probes to 300 seconds
}
catch (err) {
log.warn('ZeroMQ library too old, no support for some options: ' + err.message)
}
return sock
}
This fix was crucial for maintaining stable connections in my setup. The detailed configuration for TCP keepalive options really made the difference.
Thanks for sharing this solution!
Best, Merih Sakarya
Description: After upgrading the device-farmer from version 3.6.1 to 3.6.5, I've been encountering an issue where devices stop being displayed after 3-4 hours of operation.
Steps to reproduce:
Expected result: Devices should continue to display and function normally. I expect to get the following log:
Actual result: Devices stop being displayed after 3-4 hours. The log above does not appear.
Debugging: In an effort to diagnose the problem, I added some debug logs and found that the socket connection couldn't be established during the "group.invite" request.
Below are the relevant logs:
And these are the logs of working example;
And a snippet of the code (
group-service.js
) where I believe the issue resides:Additionally, I want to mention that the function getDevice in
control-panes-controller.js
cannot proceed to the second step, ControlService.create, because groupService.invite is not completed.Here is the getDevice function:
Update:
In response to this issue, I have implemented additional logging in
lib/units/websocket/index.js
to assist with diagnosis.Here's the updated section of the code: