In Server/Node setup, device in WebGUIs device list are accessible, but not usable (grey screen) after network disconnect

ConfusedMerlin commented 1 year ago

What is the issue or idea you have?

A Server Node Setup, where the Server runs all the processes necessary for the farm to work, while the Nodes only run a provider and adb. Affected devices are connected to a Node, but NEVER the server.

Device is listed ready-for-use in the devicefarms webGUI. You can open the details page, but the screen is grey and all the extra informations (like the connection URL) are void and empty. Also, you cannot leave the details page with the little red x above the (grey) screen any more.

All the information required for this page to work should get pulled from the RethinkDB in the moment the user opens the Details page. If the Device was not used before the error appeared, it stays empty. But if it was open just prior to the network-outage, the required content is available... and seems valid. But still, the error happens, not matter if the required fields are empty or not.

The wss String and all information necessary for filling out the necessary database fields (and later, the detail page) should be pulled from the device's provider connection process from within the docker container, where the provider is running. Said process exists, and all necessary information are not only available, but also valid (tested with the adb connection string).

So... a device, which is attached properly and has all connection information available fails to show up in the WebGUI, if the Node its attached to experiences as less as one four seconds (not measured, but the average time in my reproduction environment, see reproduction) of network off time.

By itself, it will never recover; sometimes an usb-reset will do the trick, but most times I have to restart the provider on that Node. Interestingly, if you pull the device out when this happens (either by pulling the USB cable or removing it from the VM), the entry in the WebGUI switches back to Disconnected, as it should.

For a long time I thought the USB hubs were responsible for this, but it seems they do work fine.

Also... if we have the Node run just like the Server (with all stf processes up and even its own rethinkDB, despite nothing of it is actually used, except for adb and provider) the error also appears.

Finally, I must admit that I still doubt somehow that this is the only way to create that kind of issue. I think so, because the steps to reproduce would suggest, that every device at a node disconnected briefly should encounter the same kind of error. But as said already, in our productive environment its usually that multiple devices from a node are affected (so that restarting their provider is way faster thant trying to usb-reset every single one), but not all. it like about one third of it (wild guess, based on observations) remain functional, while the rest has a... Grey Screen of Despair.

Does it only happen on a specific device? Please run adb devices -l and paste the corresponding row.

Every device not connected to the main server can be affected, usually multiple are affected at once (but not all, strangely). Of course, devices connected to the main Server are never affected (because they cannot experience a network disconnect from the perspective of their provider).

beside this I did not notice any differences between different Android versions or phone vendors. if there are any, they do remain unknown.

Also, we have a local stf branch with updated nodeJS (18.something?) and AngularJS at 1.8.3, that has the same issue. Beside that, it works quite fine (but we don't have all sub packages done, nor even touched the other frameworks hidden inside)

Please provide the steps to reproduce the issue.

First, you need three three server in some form to reproduce that in the most realistic manner. May work with two (omit the Client).

The Server (where all devicefarm processes are running)
The Node (where only adb and provider are running and the Phone for testing is attached
The Client (where the webGUI is called from)
The Device (the phone to test)

All of them should be able to find the others in their network. Can be done with VMs (I recommend VirtualBox 7.0.4 with NAT networks; 7.0.2 seems to be faulty) and physical devices. Our docker setup contains one Container for each stf process; I never tested that issue with the... normal all-STF-in-One setup you use here. its worth a try, but not now.

Have the Server start all processes, than have the Node start up, attach the Device to the Node, and finally access the Device from the Client. Should work as expected.

Then leave for the devicelist, and disconnect the Node from your network. in my VM setup, I simply untick the device/network connection entry and re-tick it just right after that. This usually takes about four seconds. Pulling and plugging a network cable with physical servers should do the same (is tested and bug-verified).

Now open the Device in your Client. The Device's screen should be grey, and stuff like the connection sting should be lost. Nothing you can do to this device short of restarting the provider process at the Node can bring it back into an usable state. Sometimes, an usbreset will do too (which made me assume a faulty USB Hub first).

What is the expected behavior?

It should either go offline, disconnected or just (which would be best) remain usable, or be renewed. I wish I could tell what part in the "gui -> db -> provider process" chain of getting the inforsmation is broken, but that I did not find out.

Do you see errors or warnings in the stf local output? If so, please paste them or the full log here.

No errors at the docker level. Neither node nor server seem to be aware of the broken connection. Not even the Browser Debug shows a real error message, it just does not start the wss, because the db entry for the wss is empty.

EDiT: Wait, there are errors in the Client browser; if you try to use the white-blue buttons below the (now grey) screen area, it says something like this: Uncaught TypeError: scope.control is null And if you hit the little, red cross, you will get a "device is null" error in the browser debugger, originating from DeviceControlCtrl.js... I think its from that, because the whole 60k+ lines js is provided as source.

Please run stf doctor and paste the output here.

... we have one container for each stf process, which one is required? from the Server provider, this looks like that:

2022-12-07T08:06:02.942Z INF/cli:doctor 19 [*] OS Arch: x64
2022-12-07T08:06:02.944Z INF/cli:doctor 19 [*] OS Platform: linux
2022-12-07T08:06:02.944Z INF/cli:doctor 19 [*] OS Platform: 5.15.0-56-generic
2022-12-07T08:06:02.945Z INF/cli:doctor 19 [*] Using Node 17.9.0
2022-12-07T08:06:02.969Z INF/cli:doctor 19 [*] Using ZeroMQ 4.2.2
2022-12-07T08:06:03.001Z ERR/cli:doctor 19 [*] RethinkDB is not installed (`rethinkdb` is missing)
2022-12-07T08:06:03.002Z ERR/cli:doctor 19 [*] ProtoBuf is not installed (`protoc` is missing)
2022-12-07T08:06:03.002Z ERR/cli:doctor 19 [*] ADB is not installed (`adb` is missing)
2022-12-07T08:06:03.134Z INF/cli:doctor 19 [*] Using GraphicsMagick 1.3.35

well... of course the rethinkDB and the adb are not in its container. And where to find the ProtoBuf?

docker setup

because its different, I guess you need at least the Server docker-compose.yml in here. The Node version is... not the same, but a reuse of it, by simply removing any service but the adb and provider and removing their now not fitting requirements. ... of course, I cannot attach .yaml files here. WHYYYYY?


version: '3'

volumes:
  rethinkdb:
  storage-temp:

networks:
  db:
  internal:
  ldap:

services:
  ldapservice:
    command: --copy-service --loglevel debug
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_OPENLDAP}
    volumes:
      - ./ldap/data:/var/lib/ldap
      - ./ldap/conf:/etc/ldap/slapd.d
    networks:
      ldap:
    restart: unless-stopped
    environment:
      - LDAP_BASE_DN=dc=devicefarm,dc=hom
      - LDAP_ORGANISATION=HouseOfMobile
      - LDAP_DOMAIN=devicefarm.hom
      - LDAP_ADMIN_PASSWORD=${LDAP_ADMIN_PASSWORD}

  ldapadminservice:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_PHPLDAP}
    environment:
      - PHPLDAPADMIN_LDAP_HOSTS=ldapservice
      - PHPLDAPADMIN_HTTPS=true
    networks:
      internal:
      ldap:
    restart: unless-stopped
    depends_on:
      - ldapservice

  ldappss:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_PSS}
    volumes:
      - ./pss/pss.conf.php:/var/www/conf/config.inc.local.php
      - ./images/STF-512.png:/var/www/html/images/logo.png
      - ./pss/pss-hom.css:/var/www/html/css/self-service-password.css
      - ./pss/en.inc.php:/var/www/lang/en.inc.php
      - ./pss/de.inc.php:/var/www/lang/de.inc.php
      - ./pss/footer.tpl:/var/www/templates/footer.tpl
    restart: unless-stopped
    depends_on:
      - ldapservice
    networks:
      internal:
      ldap:
    environment:
      - LDAP_BIND_PWD=${LDAP_ADMIN_PASSWORD}

  nginx:
          #build: nginx/
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_NGINX}
    #entrypoint: /bin/bash -c "while [ 1 ]; do sleep 1; done"
    volumes:
      - ./nginx/nginx-ssl.conf:/etc/nginx/templates/nginx.conf.devicefarm
      - ./nginx_certs/${NODE_NAME}.key:/etc/nginx/ssl/${NODE_NAME}.key:ro
      - ./nginx_certs/${NODE_NAME}.crt:/etc/nginx/ssl/${NODE_NAME}.crt:ro
      - ./nginx_certs/stf-dhparam.pem:/etc/nginx/ssl/stf-dhparam.pem:ro
    restart: unless-stopped
    ports:
      - 80:80
      - 443:443
    environment:
      - NODE_NAME
    depends_on:
      - app
      - auth
      - storage-plugin-apk
      - storage-plugin-image
      - storage-temp
      - websocket
      - api
      - ldappss
      - ldapadminservice
    networks:
      internal:

  nginx-storage:
          #build: nginx/
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_NGINX}
    volumes:
      - ./nginx/nginx-storage.conf:/etc/nginx/nginx.conf
    restart: unless-stopped
    environment:
      - NODE_NAME
    depends_on:
      - app
      - auth
      - storage-plugin-apk
      - storage-plugin-image
      - storage-temp
      - websocket
      - api
    networks:
      internal:

  adb:
    #image: adb_test
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_ADB}
    restart: unless-stopped
    privileged: true
    volumes:
      - /dev/bus/usb:/dev/bus/usb
    networks:
      internal:

  rethinkdb:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_RETHINK}
    #image: ${DOCKER_PROXY_PREFIX}rethinkdb:latest
    restart: unless-stopped
    volumes:
      - rethinkdb:/data
    ports:
      - 8080:8080
    networks:
      db:

  rethinkdb-log:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - RETHINKDB_PORT_28015_TCP
      - SECRET
    command: stf log-rethinkdb --connect-sub tcp://triproxy:7150
    networks:
      internal:
      db:

  app:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - RETHINKDB_PORT_28015_TCP
      - SECRET
    volumes:
      # Mount A4A Images
      - ./images/STF-512.png:/app/res/common/logo/exports/STF-512.png
      - ./images/afa-128.png:/app/res/common/logo/exports/STF-128.png
      - ./images/STF-512.png:/app/res/build/e831b7164c8a65b849691bff62cdb160.png
      # Mount customized MD files that might need to be changed during runtime
      #- ./pages/help.md:/app/helm.md
      #- ./pages/help.md:/app/node_modules/@devicefarmer/stf-wiki/[en]-Help.md
    command: stf app --auth-url https://${HOSTNAME}/auth/ldap/ --websocket-url wss://${HOSTNAME}/ --port 3000
    depends_on:
      - rethinkdb
      - auth
      - websocket
    networks:
      internal:
      db:

  auth:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - RETHINKDB_PORT_28015_TCP
      - SECRET
      - STF_AUTH_LDAP_LDAP_URL=ldap://ldapservice
      - STF_AUTH_LDAP_LDAP_SEARCH_DN=dc=devicefarm,dc=hom
      - STF_AUTH_LDAP_LDAP_SEARCH_FIELD=userid
      - STF_AUTH_LDAP_LDAP_BIND_DN=cn=admin,dc=devicefarm,dc=hom
      - STF_AUTH_LDAP_LDAP_BIND_CREDENTIALS=${LDAP_ADMIN_PASSWORD}
    command: stf auth-ldap --app-url https://${HOSTNAME}/ --port 3000
    depends_on:
      - rethinkdb
    networks:
      internal:
      db:
      ldap:

  processor:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - RETHINKDB_PORT_28015_TCP
    command: stf processor --connect-app-dealer tcp://triproxy:7160 --connect-dev-dealer tcp://dev-triproxy:7260
    depends_on:
      - rethinkdb
      - triproxy
      - dev-triproxy
    networks:
      internal:
      db:

  triproxy:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    command: stf triproxy app --bind-pub "tcp://*:7150" --bind-dealer "tcp://*:7160" --bind-pull "tcp://*:7170"
    ports:
      - 7150:7150
      - 7170:7170
    networks:
      internal:

  dev-triproxy:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    command: stf triproxy dev --bind-pub "tcp://*:7250" --bind-dealer "tcp://*:7260" --bind-pull "tcp://*:7270"
    ports:
      - 7250:7250
      - 7270:7270
    networks:
      internal:

  migrate:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    environment:
      - RETHINKDB_PORT_28015_TCP
      - STF_ADMIN_NAME
      - STF_ADMIN_EMAIL
    command: stf migrate
    depends_on:
      - rethinkdb
    networks:
      db:

  provider:
          #image: stf_test
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - NODE_TLS_REJECT_UNAUTHORIZED=0
    command: stf provider --name ${NODE_NAME} --connect-sub tcp://${PROVIDER_CONNECT}:7250 --connect-push tcp://${PROVIDER_CONNECT}:7270 --storage-url https://${PROVIDER_CONNECT}/ --public-ip ${HOSTNAME} --heartbeat-interval 10000 --screen-ws-url-pattern "wss://${PROVIDER_CONNECT}/d/${NODE_NAME}/<%= serial %>/<%= publicPort %>/" --adb-host adb --min-port 7400 --max-port 7700 --allow-remote
    ports:
      - 7400-7700:7400-7700
    depends_on:
      - adb
        #      - dev-triproxy
        #      - triproxy
        #      - storage-temp
    networks:
      internal:

  reaper:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - RETHINKDB_PORT_28015_TCP
    depends_on:
      - migrate
      - rethinkdb
      - dev-triproxy
      - triproxy
    command: stf reaper dev --connect-push tcp://dev-triproxy:7270 --connect-sub tcp://triproxy:7150 --heartbeat-timeout 30000
    networks:
      internal:
      db:

  storage-plugin-apk:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    command: stf storage-plugin-apk --port 3000 --storage-url http://storage-temp:3000/
    depends_on:
      - storage-temp
    networks:
      internal:

  storage-plugin-image:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    command: stf storage-plugin-image --port 3000 --storage-url http://storage-temp:3000/
    depends_on:
      - storage-temp
    networks:
      internal:

  storage-temp:
          #build: storage-temp/
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    volumes:
      - storage-temp:/app/data
    ports:
      - 3000:3000
    command: stf storage-temp --port 3000 --save-dir /app/data
    networks:
      internal:

  websocket:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - SECRET
      - RETHINKDB_PORT_28015_TCP
    command: stf websocket --port 3000 --storage-url http://storage-temp:3000/ --connect-sub tcp://triproxy:7150 --connect-push tcp://triproxy:7170
    depends_on:
      - migrate
      - rethinkdb
      - storage-temp
      - triproxy
      - dev-triproxy
    networks:
      internal:
      db:

  api:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - SECRET
      - RETHINKDB_PORT_28015_TCP
    command: stf api --port 3000 --connect-sub tcp://triproxy:7150 --connect-push tcp://triproxy:7170 --connect-sub-dev tcp://dev-triproxy:7250 --connect-push-dev tcp://dev-triproxy:7270
    depends_on:
      - migrate
      - rethinkdb
      - triproxy
    networks:
      internal:
      db:

  groups-engine:
    image: ${DOCKER_PROXY_PREFIX}${IMAGE_STF}
    restart: unless-stopped
    environment:
      - SECRET
      - RETHINKDB_PORT_28015_TCP
    command: stf groups-engine --connect-sub tcp://triproxy:7150 --connect-push tcp://triproxy:7170 --connect-sub-dev tcp://dev-triproxy:7250 --connect-push-dev tcp://dev-triproxy:7270
    depends_on:
      - migrate
      - rethinkdb
      - triproxy
      - dev-triproxy
    networks:
      internal:
      db:

Natanel-Ziv commented 1 year ago

You either have a misconfiguration with the node server (where there is only an adb and provider) or you have issue with the configuration with the nginx

ConfusedMerlin commented 1 year ago

Well, this is what the node-servers provider processes looks like (including a phone attached), where you can see what is used for what... node-06 is the main server, node-07 the node: (I just switched out internal-only-use parts of the urls):

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND stf 42 0.3 0.0 4248 3464 pts/0 Ss 09:21 0:00 /bin/bash stf 54 0.0 0.0 5900 2768 pts/0 R+ 09:21 0:00 \_ ps faux stf 55 0.0 0.0 2660 508 pts/0 S+ 09:21 0:00 \_ cat stf 1 0.1 1.4 793108 59524 ? Ssl 09:13 0:00 node /app/bin/stf provider --name node-07 --connect-sub tcp://node-06.devicefarm.de:7250 --connect-push tcp://node-06.devicefarm-mobile-net.db.de:7270 --storage-url https://node-06.devicefarm.de/ --public-ip node-07.devicefarm.de --heartbeat-interval 10000 --screen-ws-url-pattern wss://node-06.devicefarm.de/d/node-07/<%= serial %>/<%= publicPort %>/ --adb-host adb --min-port 7400 --max-port 7700 --allow-remote stf 18 0.5 2.1 1036020 86844 ? Sl 09:13 0:03 /usr/local/bin/node /app/lib/cli device --serial S62Pro202038002643 --provider node-07 --screen-port 7400 --connect-port 7401 --vnc-port 7402 --public-ip node-07.devicefarm.de --group-timeout 900 --storage-url https://node-06.devicefarm.de/ --adb-host adb --adb-port 5037 --screen-jpeg-quality 80 --screen-ping-interval 30000 --screen-ws-url-pattern wss://node-06.devicefarm.de/d/node-07/<%= serial %>/<%= publicPort %>/ --connect-url-pattern ${publicIp}:${publicPort} --heartbeat-interval 10000 --boot-complete-timeout 60000 --vnc-initial-size 600x800 --mute-master never --connect-sub tcp://node-06.devicefarm.de:7250 --connect-push tcp://node-06.devicefarm.de:7270

And that is the nginx.conf of the main server (as the node doesn't have an nginx any more):

` worker_processes auto;

events { worker_connections 4096; }

http { include /etc/nginx/conf.d/resolver.conf; keepalive_timeout 65; types_hash_max_size 2048;

log_format upstreamlog '[$time_local] $remote_addr - $remote_user - $server_name to: $upstream_addr: $request upstream_response_time $upstream_response_time msec $msec request_time $request_time';

error_log  /var/log/nginx/error.log debug;      

default_type        application/octet-stream;

upstream stf_app {
server app:3000 max_fails=0;
}

upstream stf_auth {
server auth:3000 max_fails=0;
}

upstream stf_storage_apk {
server storage-plugin-apk:3000 max_fails=0;
}

upstream stf_storage_image {
server storage-plugin-image:3000 max_fails=0;
}

upstream stf_storage {
server storage-temp:3000 max_fails=0;
}

upstream stf_websocket {
server websocket:3000 max_fails=0;
}

upstream stf_api {
server api:3000 max_fails=0;
}

types {
application/javascript  js;
image/gif               gif;
image/jpeg              jpg;
text/css                css;
text/html               html;
}

map $http_upgrade $connection_upgrade {
default  upgrade;
''       close;
}

server {
listen 80;
listen [::]:80;
return 301 https://$server_name$request_uri;
server_tokens off;
    server_name node-06.devicefarm.de;
}

server {
listen 443 ssl;
listen [::]:443 ssl;
keepalive_timeout 70;

    server_name node-06.devicefarm.de;
root /dev/null;

server_tokens off;

ssl_certificate /etc/nginx/ssl/$NODE_NAME.crt;
ssl_certificate_key /etc/nginx/ssl/$NODE_NAME.key;

location ~ "^/d/node-06/([^/]+)/(?<port>[0-9]{3,5})/$" {
    #proxy_pass http://provider:$port/;
    proxy_pass http://10.0.2.4:$port/;
    #proxy_pass http://10.35.142.30:$port/;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_set_header X-Real-IP $remote_addr;
}

location ~ "^/d/node-07/([^/]+)/(?<port>[0-9]{3,5})/$" {
    proxy_pass http://10.0.2.5:$port/;
    #proxy_pass http://192.168.240.$ippart:$port/;
    #proxy_pass http://10.35.142.$ippart:$port/;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header X-Forwarded-For $remote_addr;
    proxy_set_header X-Real-IP $remote_addr;
}

location /pss/ {
    proxy_pass http://ldappss/;
}

location /ldapadmin/ {
    proxy_pass https://ldapadminservice/;
}

location /auth/ {
    proxy_pass http://stf_auth/auth/;
}

location /api/ {
    proxy_pass http://stf_api/api/;
}

location /s/image/ {
    proxy_pass http://stf_storage_image;
}

location /s/apk/ {
    proxy_pass http://stf_storage_apk;
}

location /s/ {
    client_max_body_size 1024m;
    client_body_buffer_size 128k;
    proxy_pass http://stf_storage;
}

location /socket.io/ {
    proxy_pass http://stf_websocket;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $http_x_real_ip;
}

location / {
    proxy_pass http://stf_app;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Real-IP $http_x_real_ip;
}
}

}`

I am tempted to say "they do look correct", but after I keep on failing to spot a cause for that error... maybe they dont?

And again, it works well UNTIL i briefly disconnect the Node... a proper MQ setup should be able to handle this, I would assume. So... why this one seems to fail? Speaking of MQ, I made a custom client to listen to the messages transmitted when it works (open and close) and when it doesn't (open, but no close):

`b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\n\x12s\nQ\n&devicefarm@domain.com\x12\radministrator\x1a\x18d4UKTg5FQl+hfHt9Ptmx8g==\x1a\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.b9c7baef-7332-4cf3-9a57-1a45df19b47d" b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x085\x12\x00\x1a'tx.a191daa4-f5d0-4c8b-96ff-0d2cc15f1fc1" b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\x1b\x12 \x12\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.6cceeae6-d884-4f32-a720-66edf67eb5f1"

b"+8WzZS0hSFCNL6S1o3qd2tCM/ss=\x08\n\x12s\nQ\n&devicefarm@domain.com\x12\radministrator\x1a\x18d4UKTg5FQl+hfHt9Ptmx8g==\x1a\x1e\n\x06serial\x12\x12S62Pro202038002643\x18\x03\x1a'tx.39efdc60-7dae-447e-9a4d-1cec791bedd7" `

To be honest, I cannot see anything in there, beside the lack of messages after the opening one.

ConfusedMerlin commented 1 year ago

The new super AI (chat.openai.com) actually had an interesting answer to my question, which I did not find by classical googleing:

Subscriber are considered unreachable aber their network connection had an outage, because publisher and subscriber maintain their own state and are kind of "out of sync". Both parts have their own view of the system, which may causes them to get out of sync. Which can result in messages being lost.

Though, it didnt offer a solution that fit to this very setup.

ConfusedMerlin commented 1 year ago

Okay, I have a new piece of that puzzle: you need Docker to get this error. I do not know why as for now....

How to reproduce that? Make a Docker Image, containing a fitting nodeJS version (16 makes troubes, 17 is out of support, but the 14 works well) and the zeroMQ used in the STF (5.2.8). Than add micro scripts to run a Broker and a Subscriber, looking like that (I used a tutorial from dev.to) :

`// @/server.js const Fastify = require("fastify"); const zmq = require("zeromq");

const app = Fastify(); const sock = zmq.socket("pub"); // const sock = new zmq.Publisher();

app.post("/", async (request, reply) => { await sock.send(["topic", JSON.stringify({ ...request.body })]); return reply.send("Sent to the subscriber/worker."); });

const main = async () => { try { await sock.bind("tcp://*:7890"); await app.listen(3000, '0.0.0.0'); } catch (err) { console.error(err); process.exit(1); } }; main();`

`// @/worker.js const zmq = require("zeromq");

const sock = zmq.socket("sub"); // const sock = new zmq.Subscriber(); // is for zeromq 6 //

const main = async () => { try { sock.connect("tcp://10.0.2.4:7890"); sock.subscribe("topic"); // for await (const [topic, msg] of sock) { // console.log("Received message from " + topic + " channel:"); // console.log(JSON.parse(msg)); //} sock.on("message", function(topic, message) { console.log("message:", message, "topic", topic);

            });
    }catch(err) {
            console.error(err);
            process.exit(1);
    }

}; main();`

put them in one image (two will do too, but its unnecessary) and start each of them at a different VM (which should be able to talk to each other).

if you now do

curl -X POST ip.of.your.vm:3000 -d "heyho"

the worker should print that out... well, as a byte string. But that is not the point. Now disconnect that VM from the network and attach it right afterwards, and.... tadaaa, it not stopped receiving messages.

if you try that setup without the docker container, you will notice that the Worker still is able to receive messages after a network re-connect. So... this is a zeromq, or nodejs, or docker problem?

satadip commented 1 year ago

Hi @ConfusedMerlin , we are also facing a similar issue. Post any network disconnection the device list still shows the device but on trying to view the device it shows up as the image attached. Is it a similar issue, which you too are facing and if any solution for the same you have identified?

only1God commented 1 year ago

I had similar issue. This is usually happens due to device adb disconnection. We fixed this issue by creating our own adb docker image. In that image we have several cron jobs that reset the usb driver if device is disconnected and another that perform adb reset. That solved the issue for us

This is where I got the inspiration https://github.com/agoda-com/adb-butler

ConfusedMerlin commented 1 year ago

We still haven't gotten an useful solution, beside cronjobs to restart things. I think I even escalated this into the bug reporting section of the guys who maintain the messaging queue, but... no reaction so far

only1God commented 1 year ago

right but I think you would like any solution against no solution :)

DeviceFarmer / stf

In Server/Node setup, device in WebGUIs device list are accessible, but not usable (grey screen) after network disconnect #623