Open ConfusedMerlin opened 1 year ago
You either have a misconfiguration with the node server (where there is only an adb and provider) or you have issue with the configuration with the nginx
Well, this is what the node-servers provider processes looks like (including a phone attached), where you can see what is used for what... node-06 is the main server, node-07 the node: (I just switched out internal-only-use parts of the urls):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND stf 42 0.3 0.0 4248 3464 pts/0 Ss 09:21 0:00 /bin/bash stf 54 0.0 0.0 5900 2768 pts/0 R+ 09:21 0:00 \_ ps faux stf 55 0.0 0.0 2660 508 pts/0 S+ 09:21 0:00 \_ cat stf 1 0.1 1.4 793108 59524 ? Ssl 09:13 0:00 node /app/bin/stf provider --name node-07 --connect-sub tcp://node-06.devicefarm.de:7250 --connect-push tcp://node-06.devicefarm-mobile-net.db.de:7270 --storage-url https://node-06.devicefarm.de/ --public-ip node-07.devicefarm.de --heartbeat-interval 10000 --screen-ws-url-pattern wss://node-06.devicefarm.de/d/node-07/<%= serial %>/<%= publicPort %>/ --adb-host adb --min-port 7400 --max-port 7700 --allow-remote stf 18 0.5 2.1 1036020 86844 ? Sl 09:13 0:03 /usr/local/bin/node /app/lib/cli device --serial S62Pro202038002643 --provider node-07 --screen-port 7400 --connect-port 7401 --vnc-port 7402 --public-ip node-07.devicefarm.de --group-timeout 900 --storage-url https://node-06.devicefarm.de/ --adb-host adb --adb-port 5037 --screen-jpeg-quality 80 --screen-ping-interval 30000 --screen-ws-url-pattern wss://node-06.devicefarm.de/d/node-07/<%= serial %>/<%= publicPort %>/ --connect-url-pattern ${publicIp}:${publicPort} --heartbeat-interval 10000 --boot-complete-timeout 60000 --vnc-initial-size 600x800 --mute-master never --connect-sub tcp://node-06.devicefarm.de:7250 --connect-push tcp://node-06.devicefarm.de:7270
And that is the nginx.conf of the main server (as the node doesn't have an nginx any more):
` worker_processes auto;
events { worker_connections 4096; }
http { include /etc/nginx/conf.d/resolver.conf; keepalive_timeout 65; types_hash_max_size 2048;
log_format upstreamlog '[$time_local] $remote_addr - $remote_user - $server_name to: $upstream_addr: $request upstream_response_time $upstream_response_time msec $msec request_time $request_time';
error_log /var/log/nginx/error.log debug;
default_type application/octet-stream;
upstream stf_app {
server app:3000 max_fails=0;
}
upstream stf_auth {
server auth:3000 max_fails=0;
}
upstream stf_storage_apk {
server storage-plugin-apk:3000 max_fails=0;
}
upstream stf_storage_image {
server storage-plugin-image:3000 max_fails=0;
}
upstream stf_storage {
server storage-temp:3000 max_fails=0;
}
upstream stf_websocket {
server websocket:3000 max_fails=0;
}
upstream stf_api {
server api:3000 max_fails=0;
}
types {
application/javascript js;
image/gif gif;
image/jpeg jpg;
text/css css;
text/html html;
}
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 80;
listen [::]:80;
return 301 https://$server_name$request_uri;
server_tokens off;
server_name node-06.devicefarm.de;
}
server {
listen 443 ssl;
listen [::]:443 ssl;
keepalive_timeout 70;
server_name node-06.devicefarm.de;
root /dev/null;
server_tokens off;
ssl_certificate /etc/nginx/ssl/$NODE_NAME.crt;
ssl_certificate_key /etc/nginx/ssl/$NODE_NAME.key;
location ~ "^/d/node-06/([^/]+)/(?<port>[0-9]{3,5})/$" {
#proxy_pass http://provider:$port/;
proxy_pass http://10.0.2.4:$port/;
#proxy_pass http://10.35.142.30:$port/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Real-IP $remote_addr;
}
location ~ "^/d/node-07/([^/]+)/(?<port>[0-9]{3,5})/$" {
proxy_pass http://10.0.2.5:$port/;
#proxy_pass http://192.168.240.$ippart:$port/;
#proxy_pass http://10.35.142.$ippart:$port/;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header X-Real-IP $remote_addr;
}
location /pss/ {
proxy_pass http://ldappss/;
}
location /ldapadmin/ {
proxy_pass https://ldapadminservice/;
}
location /auth/ {
proxy_pass http://stf_auth/auth/;
}
location /api/ {
proxy_pass http://stf_api/api/;
}
location /s/image/ {
proxy_pass http://stf_storage_image;
}
location /s/apk/ {
proxy_pass http://stf_storage_apk;
}
location /s/ {
client_max_body_size 1024m;
client_body_buffer_size 128k;
proxy_pass http://stf_storage;
}
location /socket.io/ {
proxy_pass http://stf_websocket;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $http_x_real_ip;
}
location / {
proxy_pass http://stf_app;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $http_x_real_ip;
}
}
}`
I am tempted to say "they do look correct", but after I keep on failing to spot a cause for that error... maybe they dont?
And again, it works well UNTIL i briefly disconnect the Node... a proper MQ setup should be able to handle this, I would assume. So... why this one seems to fail? Speaking of MQ, I made a custom client to listen to the messages transmitted when it works (open and close) and when it doesn't (open, but no close):
`b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\n\x12s\nQ\n&devicefarm@domain.com\x12\radministrator\x1a\x18d4UKTg5FQl+hfHt9Ptmx8g==\x1a\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.b9c7baef-7332-4cf3-9a57-1a45df19b47d" b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x085\x12\x00\x1a'tx.a191daa4-f5d0-4c8b-96ff-0d2cc15f1fc1" b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\x1b\x12 \x12\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.6cceeae6-d884-4f32-a720-66edf67eb5f1"
b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\n\x12s\nQ\n&devicefarm@domain.com\x12\radministrator\x1a\x18d4UKTg5FQl+hfHt9Ptmx8g==\x1a\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.39efdc60-7dae-447e-9a4d-1cec791bedd7" `
To be honest, I cannot see anything in there, beside the lack of messages after the opening one.
The new super AI (chat.openai.com) actually had an interesting answer to my question, which I did not find by classical googleing:
Subscriber are considered unreachable aber their network connection had an outage, because publisher and subscriber maintain their own state and are kind of "out of sync". Both parts have their own view of the system, which may causes them to get out of sync. Which can result in messages being lost.
Though, it didnt offer a solution that fit to this very setup.
Okay, I have a new piece of that puzzle: you need Docker to get this error. I do not know why as for now....
How to reproduce that? Make a Docker Image, containing a fitting nodeJS version (16 makes troubes, 17 is out of support, but the 14 works well) and the zeroMQ used in the STF (5.2.8). Than add micro scripts to run a Broker and a Subscriber, looking like that (I used a tutorial from dev.to) :
`// @/server.js const Fastify = require("fastify"); const zmq = require("zeromq");
const app = Fastify(); const sock = zmq.socket("pub"); // const sock = new zmq.Publisher();
app.post("/", async (request, reply) => { await sock.send(["topic", JSON.stringify({ ...request.body })]); return reply.send("Sent to the subscriber/worker."); });
const main = async () => { try { await sock.bind("tcp://*:7890"); await app.listen(3000, '0.0.0.0'); } catch (err) { console.error(err); process.exit(1); } }; main();`
`// @/worker.js const zmq = require("zeromq");
const sock = zmq.socket("sub"); // const sock = new zmq.Subscriber(); // is for zeromq 6 //
const main = async () => { try { sock.connect("tcp://10.0.2.4:7890"); sock.subscribe("topic"); // for await (const [topic, msg] of sock) { // console.log("Received message from " + topic + " channel:"); // console.log(JSON.parse(msg)); //} sock.on("message", function(topic, message) { console.log("message:", message, "topic", topic);
});
}catch(err) {
console.error(err);
process.exit(1);
}
}; main();`
put them in one image (two will do too, but its unnecessary) and start each of them at a different VM (which should be able to talk to each other).
if you now do
curl -X POST ip.of.your.vm:3000 -d "heyho"
the worker should print that out... well, as a byte string. But that is not the point. Now disconnect that VM from the network and attach it right afterwards, and.... tadaaa, it not stopped receiving messages.
if you try that setup without the docker container, you will notice that the Worker still is able to receive messages after a network re-connect. So... this is a zeromq, or nodejs, or docker problem?
Hi @ConfusedMerlin , we are also facing a similar issue. Post any network disconnection the device list still shows the device but on trying to view the device it shows up as the image attached. Is it a similar issue, which you too are facing and if any solution for the same you have identified?
I had similar issue. This is usually happens due to device adb disconnection. We fixed this issue by creating our own adb docker image. In that image we have several cron jobs that reset the usb driver if device is disconnected and another that perform adb reset. That solved the issue for us
This is where I got the inspiration https://github.com/agoda-com/adb-butler
We still haven't gotten an useful solution, beside cronjobs to restart things. I think I even escalated this into the bug reporting section of the guys who maintain the messaging queue, but... no reaction so far
right but I think you would like any solution against no solution :)
What is the issue or idea you have?
A Server Node Setup, where the Server runs all the processes necessary for the farm to work, while the Nodes only run a provider and adb. Affected devices are connected to a Node, but NEVER the server.
Device is listed ready-for-use in the devicefarms webGUI. You can open the details page, but the screen is grey and all the extra informations (like the connection URL) are void and empty. Also, you cannot leave the details page with the little red x above the (grey) screen any more.
All the information required for this page to work should get pulled from the RethinkDB in the moment the user opens the Details page. If the Device was not used before the error appeared, it stays empty. But if it was open just prior to the network-outage, the required content is available... and seems valid. But still, the error happens, not matter if the required fields are empty or not.
The wss String and all information necessary for filling out the necessary database fields (and later, the detail page) should be pulled from the device's provider connection process from within the docker container, where the provider is running. Said process exists, and all necessary information are not only available, but also valid (tested with the adb connection string).
So... a device, which is attached properly and has all connection information available fails to show up in the WebGUI, if the Node its attached to experiences as less as one four seconds (not measured, but the average time in my reproduction environment, see reproduction) of network off time.
By itself, it will never recover; sometimes an usb-reset will do the trick, but most times I have to restart the provider on that Node. Interestingly, if you pull the device out when this happens (either by pulling the USB cable or removing it from the VM), the entry in the WebGUI switches back to Disconnected, as it should.
For a long time I thought the USB hubs were responsible for this, but it seems they do work fine.
Also... if we have the Node run just like the Server (with all stf processes up and even its own rethinkDB, despite nothing of it is actually used, except for adb and provider) the error also appears.
Finally, I must admit that I still doubt somehow that this is the only way to create that kind of issue. I think so, because the steps to reproduce would suggest, that every device at a node disconnected briefly should encounter the same kind of error. But as said already, in our productive environment its usually that multiple devices from a node are affected (so that restarting their provider is way faster thant trying to usb-reset every single one), but not all. it like about one third of it (wild guess, based on observations) remain functional, while the rest has a... Grey Screen of Despair.
Does it only happen on a specific device? Please run
adb devices -l
and paste the corresponding row.Every device not connected to the main server can be affected, usually multiple are affected at once (but not all, strangely). Of course, devices connected to the main Server are never affected (because they cannot experience a network disconnect from the perspective of their provider).
beside this I did not notice any differences between different Android versions or phone vendors. if there are any, they do remain unknown.
Also, we have a local stf branch with updated nodeJS (18.something?) and AngularJS at 1.8.3, that has the same issue. Beside that, it works quite fine (but we don't have all sub packages done, nor even touched the other frameworks hidden inside)
Please provide the steps to reproduce the issue.
First, you need three three server in some form to reproduce that in the most realistic manner. May work with two (omit the Client).
All of them should be able to find the others in their network. Can be done with VMs (I recommend VirtualBox 7.0.4 with NAT networks; 7.0.2 seems to be faulty) and physical devices. Our docker setup contains one Container for each stf process; I never tested that issue with the... normal all-STF-in-One setup you use here. its worth a try, but not now.
Have the Server start all processes, than have the Node start up, attach the Device to the Node, and finally access the Device from the Client. Should work as expected.
Then leave for the devicelist, and disconnect the Node from your network. in my VM setup, I simply untick the device/network connection entry and re-tick it just right after that. This usually takes about four seconds. Pulling and plugging a network cable with physical servers should do the same (is tested and bug-verified).
Now open the Device in your Client. The Device's screen should be grey, and stuff like the connection sting should be lost. Nothing you can do to this device short of restarting the provider process at the Node can bring it back into an usable state. Sometimes, an usbreset will do too (which made me assume a faulty USB Hub first).
What is the expected behavior?
It should either go offline, disconnected or just (which would be best) remain usable, or be renewed. I wish I could tell what part in the "gui -> db -> provider process" chain of getting the inforsmation is broken, but that I did not find out.
Do you see errors or warnings in the
stf local
output? If so, please paste them or the full log here.No errors at the docker level. Neither node nor server seem to be aware of the broken connection. Not even the Browser Debug shows a real error message, it just does not start the wss, because the db entry for the wss is empty.
EDiT: Wait, there are errors in the Client browser; if you try to use the white-blue buttons below the (now grey) screen area, it says something like this: Uncaught TypeError: scope.control is null And if you hit the little, red cross, you will get a "device is null" error in the browser debugger, originating from DeviceControlCtrl.js... I think its from that, because the whole 60k+ lines js is provided as source.
Please run
stf doctor
and paste the output here.... we have one container for each stf process, which one is required? from the Server provider, this looks like that:
well... of course the rethinkDB and the adb are not in its container. And where to find the ProtoBuf?
docker setup
because its different, I guess you need at least the Server docker-compose.yml in here. The Node version is... not the same, but a reuse of it, by simply removing any service but the adb and provider and removing their now not fitting requirements. ... of course, I cannot attach .yaml files here. WHYYYYY?