CGRU / cgru

CGRU - AFANASY
http://cgru.info/
GNU Lesser General Public License v3.0
276 stars 111 forks source link

Render connection problems on last msg send causes render task to freeze #585

Open eberrippe opened 11 months ago

eberrippe commented 11 months ago

Hi Timur, you remember our chat in the CGRU Telegram yesterday about our connection problems?

The problem is that the render does not notify the server that it finished the task. So the task stays at no progress but the server can also not eject it with the "task no change timeout". We suppose its because the socket on the render freezes somehow cause of an connection issue. Our only way of fixing is to exit the render client or restarting the afserver.

We were checking the code and we suppose that we found the issue we are running into.

What happens if the last output from the task was not sent successfully? We think there should be some kind of sleep and reconnection try going on here?

https://github.com/CGRU/cgru/blob/11e78f28060de14516d577c7a22f5ab6426f85cd/afanasy/src/render/renderhost.cpp#L274-L287

We imagine a solution could be sth like:

    while not ok {
        bool ok;
        af::Msg * server_answer = af::sendToServer( msg, ok,
            msg->type() == af::Msg::TRenderRegister ? af::VerboseOff : af::VerboseOn);
        if (ok)
            connectionEstablished();
        else
            serverUpdateFailed();
        delete msg;
        sleep(1)
        if retries > 5 {
            break;
        }
        m_up.clear();
        return server_answer;
        retry++;
        }

Or would this cause a total render lag?

Thanks a lot!

timurhai commented 11 months ago

Hello, Jan! This function already runs in a cycle:

https://github.com/CGRU/cgru/blob/11e78f28060de14516d577c7a22f5ab6426f85cd/afanasy/src/render/main.cpp#L166-L176

Cycle in a cycle not needed.

ultra-sonic commented 9 months ago

Hi Timur,

here are the most common log entries that we found on frozen afrender clients:

Mon 25 Sep 09:34.06: INFO    Failed to connect to server. Last success connect time: Mon 25 Sep 09:33.23
Mon 25 Sep 09:34.06: INFO    Last connect was: 43 seconds ago, Zombie time: 180s
AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate':
10.10.8.150:51000: Connection refused

and

D=19970: Exit Code=0 Status=0
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Tue 26 Sep 10:15.31: INFO    Failed to connect to server. Last success connect time: Tue 26 Sep 10:15.18
Tue 26 Sep 10:15.31: INFO    Last connect was: 13 seconds ago, Zombie time: 180s
Tue 26 Sep 10:45.21: ERROR   RenderHost::stopTask: No such task: TP: j171 b2 t422 (now running 0 tasks).
Tue 26 Sep 10:45.21: ERROR   RenderHost::stopTask: No such task: TP: j963 b3 t17 (now running 0 tasks).

Is there any way to get more verbose logging especially on what is going on with our sockets?

To me it would be interesting to get a log output every time RCVTIMEO and SNDTIMEO have been reached (on both ends -afserver and afrender) - I suspect this to be the cause of the Resource temporarily unavailable error

to debug AFERROR: msgsendtoaddress: connect failure for msgType 'TRenderUpdate': 10.10.8.150:51000: Connection refused it would be great to know if the socket was in a TIME_WAIT state and therefore refused our connection attempt because the previous connection was still not properly terminated. is there a way to log more verbose info on connect failure as well?

I have a feeling that in our network we should increase RCVTIMEO and SNDTIMEO but not sure by how much? will this setting interfere with other timeouts like zombietime etc?

also please remind me where we can configure the delay that afrender waits between its connection attempts to the server. I think this should be longer then 120 secs because it may be that:

connections will hold the TCP port in the TIME_WAIT state for 30-120 seconds from https://stackoverflow.com/questions/3229860/what-is-the-meaning-of-so-reuseaddr-setsockopt-option-linux

Cheers Oli

@sebastianelsner

timurhai commented 9 months ago

Hi, Oliver! Afanasy uses macros to output system call errors: https://github.com/CGRU/cgru/blob/master/afanasy/src/libafanasy/name_afnet.cpp#L247 Macros are defined there: https://github.com/CGRU/cgru/blob/master/afanasy/src/include/macrooutput.h And just uses perror to print a system error message: https://man7.org/linux/man-pages/man3/perror.3.html For now i don`t know a way to get more detailed error explanation. But may be someone, who more familiar with a Linux system calls can help.

timurhai commented 9 months ago

You can monitor sockets states by such command as netstat, ss and so on. TIME-WAIT can be a problem: https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait If so, may be you can reduce Maximum Segment Lifetime ?

ultra-sonic commented 9 months ago

Hi Timur, we are still trying to figure out what is going on in our network, but I have a thing were you may be able to help us.

Today we had this behaviour:

  1. task finished
  2. wants to inform the server that it is done
  3. message time-out
  4. client takes on a new job after it has reconnected a bit later
  5. server still "thinks" the client is rendering the task that has already been completed and there is no way to get rid of the "ghost" task

see this log:

Thu 07 Dec 14:40.12: INFO    Finished PID=20050: Exit Code=256 Status=1
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:40.52: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:40.38
Thu 07 Dec 14:40.52: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:41.18: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:41.04
Thu 07 Dec 14:41.18: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:43.24: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:43.10
Thu 07 Dec 14:43.24: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:45.11: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:44.57
Thu 07 Dec 14:45.11: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:46.26: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:46.12
Thu 07 Dec 14:46.26: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:50.31: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:50.16
Thu 07 Dec 14:50.31: INFO    Last connect was: 15 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:59.38: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:59.24
Thu 07 Dec 14:59.38: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.

can you add a mechanism that detects such ghost tasks? or is there already something in place that just does not work in this case?

cheers Oli

timurhai commented 9 months ago

Hello, Oli! You want to say, that new task generate message that server can read and answer. And old task still can't connect to server? This can't happen, as render connects server once per cycle and send & recv one data. Different tasks fills data and then reads own portion. So, if render can connect to server each task can connect to server. Or you want to say that it is not?

lithorus commented 9 months ago

I just wanted to add that we're seeing the exact same thing. It tends to happen more often when the server is under heavy load and tasks are fast to exit. My theory is that it starts and stops before the server acknowledges the starting of the task, but I have yet been able to re-produce the issue in a non-production system.

eberrippe commented 7 months ago

Hey guys,

I am writing to update you guys (@timurhai @lithorus ) with some statistics of how many of these "frozen tasks" we have.

@lithorus We made a little cronjob that runs and checks how long a task is on a render, then exits the render if the task progress stays at 0% for 30mins or the set "task-no-progress" time. We also add an entry in the DB to find out how many tasks we are exiting to see how many 30mins or more rendertime we waste a day. We have roughly 1000 renders connected atm.

In the following charts you can see the "amount of renders exited". A 14 means that we exited 14 renders at that exact moment.

Here are the exits of the last 90 days.

image

Here a smaller scaled last 7days.

image

This little extra task-time-out-watcher-skript "works" since almost no extra wrangling is needed but it really is the biggest problem we are facing atm.

The problem is evenly spread among all our renders and parsers.

You can monitor sockets states by such command as netstat, ss and so on. TIME-WAIT can be a problem: https://cgru.readthedocs.io/en/latest/afanasy/server.html#time-wait If so, may be you can reduce Maximum Segment Lifetime ?

@timurhai, our IT monitored the stats and couldn't find any suspicious stats or behavior.

I checked the code myself (I am not a c++ pro) and looking at the typical error message the problem may be solved somewhere here? https://github.com/CGRU/cgru/blob/1fd2bc4d198a848deaa509d5d502a15750b8fe29/afanasy/src/libafanasy/name_afnet.cpp#L486-L493

Since this returns false in our case, the render should exit itself anyways but it seems like it does not work.

https://github.com/CGRU/cgru/blob/1fd2bc4d198a848deaa509d5d502a15750b8fe29/afanasy/src/libafanasy/name_afnet.cpp#L308-L314

If anyone of you guys (@lithorus @timurhai) has an Idea on how we can do sth to detect "frozen tasks" earlier (maybe in the afanasycode directly ourselves) to not wait 30mins or more. It would be very helpful.

Thanks a lot,

Best Jan

lithorus commented 7 months ago

@eberrippe Good idea about the cron job, although 30 min is a bit long for many of our tasks. But good to have as a fallback. Might implement the same here as a stop gap solution.

My theory is that it comes if there's packet loss and/or the server fails to process the "I'm done" request from the client, so the server still thinks it's doing things. One way to detect it aswell, is if you try to pull the output log of the task it will return something in the line of "Render has no task".

One way to solve that might be some kind of keep-alive check of the tasks between server and render node.

eberrippe commented 7 months ago

@lithorus Uff... i like the idea of the "keep-alive" check you mentioned. This could be worth to try out and to not waste the 30mins/taskNoProgressTimeout of rendertime but to crawl each renders taskoutput every X secounds or minutes and to see if the output matches, followed by an exit.

/sign on what you've mentioned. We also think exactly the same. Cause sometimes files are created correctly but it just won't exit the socket completely.

Hoping that Timur maybe has a solution to make the connection more fallout reliable and error proofed.

eberrippe commented 7 months ago

@lithorus And here a little very brief guide on how our script works atm:

Check every 30 sec each renderhost:
     if render running:     
          renderTask = fromRender
          if renderTask.progress == 0%:
                 renderTask0ProgressStamp = getRendeTaskFromDB()
                 if renderTask0ProgressStamp exists:
                      if renderTimeProgressStamp +  (task.getTaskNoProgressTimeout or 30mns)  < datetime.now():
                             if "WRN" in jobstate or "DON" in jobstate:
                                  exitRender(render)
                 else rendertask does not exist, so put into DB:
                        writeToDB(RenderID : {job_id, blocknum, tasknum, percentage, datetime.now(), jobname}) <-- this should be append because a render can have multiple tasks.

(Sure you would need some kind of "if progress was done: delete the entry from the DB" stuff, but you get our basic approach.)

Best Jan

timurhai commented 7 months ago

Hello! So, when render stucks, it starts to repeat this log:

AFERROR: af::msgread: can't read message header, bytes = -1 (< Msg::SizeHeader).
AFERROR: msgsendtoaddress: Reading binary answer failed.
Thu 07 Dec 14:45.11: INFO    Failed to connect to server. Last success connect time: Thu 07 Dec 14:44.57
Thu 07 Dec 14:45.11: INFO    Last connect was: 14 seconds ago, Zombie time: 180s
AFERROR: readdata: read: Resource temporarily unavailable

This means that this system call returns < 0: https://github.com/CGRU/cgru/blob/1fd2bc4d198a848deaa509d5d502a15750b8fe29/afanasy/src/libafanasy/name_afnet.cpp#L42 And produces this system error message: Resource temporarily unavailable

What is server log this time? What server "thinks" about this render and its tasks? It does not became "zombie" and still runs latest task? And task progress timeout does not works?

You want "stucked" render to exit manually to became zombie on server and later it can be started again?

eberrippe commented 7 months ago

What is server log this time? What server "thinks" about this render and its tasks? It does not became "zombie" and still runs latest task? And task progress timeout does not works?

@timurhai The Server does not recognize that the render has problems and shows the render and the task as RUN. If we want to stop the job, pause the job, eject the task, delete the job, nothing works. That's probably why the original task-progress-timeout does not work. Only if we exit the render client, the job is "released."

Example:

The Render is RUN, the task is RUN. I see the runtime is 40mins, but task progress 0%. I click "delete job". Nothing happens. I exit the render. Suddenly the job deletes itself.

It seems almost like the commands are queued and can not work until the jobs task was released by exiting the renderclient. Somtimes the stuck 0% taskoutput is: frame 1153 Render has no task:TP: j542 b3 t15 on the afserver.

You want "stucked" render to exit manually to became zombie on server and later it can be started again?

This is our only option to make everything work again at the moment. Thats why we have this skript, which runs to see if a job has no progress. and then we exit the render. So our render reconnects and it can start again.

Thats why I am asking, is there any option to fix this issue?

Thanks a lot Jan

timurhai commented 7 months ago

So, server log has errors, something unusual when renders stucks?

Also you can try to use recv instead of read in this line: https://github.com/CGRU/cgru/blob/1fd2bc4d198a848deaa509d5d502a15750b8fe29/afanasy/src/libafanasy/name_afnet.cpp#L42 May be the system error will be differ.

lithorus commented 7 months ago

We have the same problem and ours are :

af_server_linux_epoll: 1
af_server_sockets_processing_threads_num: 30
af_server_sockets_readwrite_threads_num: 30

However we have the issue on another server with default options.

timurhai commented 7 months ago

Can you provide server logs on render stuck?

You can increase pool Heart Beat seconds for renders for a big farm (pool). By default it is 1 second. So 1000 renders connects to server every second. On a remote pool we set it to 5 seconds.

eberrippe commented 7 months ago

Here are our settings:

We have 1000 renderclients connecting, ~200 monitors (heartbeat 2,5sec) , ~NIMBY-Trays-Setters (heartbeat 5sec), ~5 get-coronjobs (heartbeat 5sec)

cgru/3.3.0/config_default.json

"include":["afanasy/config_default.json","config.json"],

cgru/3.3.0/config.json

{
    "cgru_config": {
        "af_servername": "xxxxx",
        "OS_unix": {
            "af_rendercmds": ["vncviewer @ARG@"],
            "af_store_folder":"/var/tmp/afanasy",
            "af_rendercmds_admin": ["terminal -x \"ssh @ARG@\"",
                                    "terminal -x \"ssh root@@ARG@\"",
                                    "ssh root@@ARG@ \"service afrender stop;service afrender start\"",
                                    "ssh root@@ARG@ \"service afrender start\"",
                                    "ssh root@@ARG@ \"pgrep mantra | xargs kill -SIGSTOP\"",
                                    "ssh root@@ARG@ \"pgrep mantra | xargs kill -SIGCONT\"",
                                    "ssh root@@ARG@ \"pgrep hython | xargs kill -SIGSTOP\"",
                                    "ssh root@@ARG@ \"pgrep hython | xargs kill -SIGCONT\""
                                   ]
        },
        "OS_windows":{
            "pathsmap":[["Z:/","/mutter/"],
                        ["C:/Program Files/","/usr/local/"]]
        },
        "af_thumbnail_cmd": "oiiotool \"%(image)s\" -ch R,G,B --resize 0x100 -o  \"%(thumbnail)s\"",
        "af_task_update_timeout": 240,
        "af_monitor_zombietime": 59,
        "af_render_zombietime": 60,
        "af_render_heartbeat_sec": 6,
        "af_render_up_resources_period":3,
        "af_render_cmd_shutdown": "$CGRU_LOCATION/utilities/render_shutdown_guard '/sbin/poweroff'",
        "af_render_cmd_reboot": "$CGRU_LOCATION/utilities/render_shutdown_guard '/sbin/reboot'",
        "af_errors_forgivetime":86400,
        "af_server_linux_epoll":1,
        "af_server_sockets_processing_threads_num":20,
        "af_server_sockets_readwrite_threads_num":20,
        "previewcmds": ["rv \"@ARG@\"", "iv \"@ARG@\""],
        "pswd_visor":"9c50c09bd37e8adddc3e0362e14c2695",
        "pswd_god":"a5a75adbdf0e510fd26d7c9a27ef384e",
        "af_solving_tasks_speed":-1,
        "af_task_default_capacity":6
    }
}

cgru/3.3.0/afanasy/config_default.json:

"":"Network:",
    "":"Number of threads to process clients connections.",
    "af_server_sockets_processing_threads_num":10,
        "":"Ten threads to process should be enough for any situation.",

    "":"Number of threads to read/write clients connections.",
    "af_server_sockets_readwrite_threads_num":10,
        "":"Ten threads does not mean that server can handle only about 10 clients.",
        "":"This only means that it can read/write to 10 clients at the same time.",
        "":"Other clients will just wait in IO queue.",
        "":"For a fast networks designed for rendering 10 IO threads should be enough for any farm size.",

    "":"Server sockets processing stack size.",
    "-af_server_sockets_processing_threads_stack":32768,
    "-af_server_sockets_readwrite_threads_stack":32768,
        "":"Zero, negative value or no parameter means to use system default",

    "":"You can use non-blocking IO on Linux server, based on Linux epoll facility",
    "af_server_linux_epoll":0,
        "":"If it is disabled (by default), Linux server will use blocking IO based on threads, like other platforms",

    "":"Server waits client close socket first. Web browsers do it, only if we ask it in HTTP header by special header",
    "af_server_http_wait_close":1,
        "":"If you browser ignores 'Connection: close' header, you can make server not to wait it.",

    "":"Socket options that can be set to play with:",
    "af_so_server_RCVTIMEO_sec":24,
    "af_so_server_SNDTIMEO_sec":24,
    "af_so_server_TCP_NODELAY":-1,
    "af_so_server_REUSEADDR":-1,
    "af_so_server_TCP_CORK":-1,
    "af_so_server_LINGER":-1,

    "af_so_client_RCVTIMEO_sec":24,
    "af_so_client_SNDTIMEO_sec":24,
    "af_so_client_TCP_NODELAY":-1,
    "af_so_client_REUSEADDR":-1,
    "af_so_client_TCP_CORK":-1,
    "af_so_client_LINGER":-1,
        "":"Use -1 value not to set socket option at all",
        "":"See man socket for details.",

    "af_server_profiling_sec":1024,
        "":"Server will output some network statistics by this period",

    "af_wolwake_interval":10,
        "":"Number of cycles (seconds) between waking each render",

    "":""

cgru/3.3.0/afanasy/config.json: <--- (Since this is nowhere included, i suppose its unused. 🤔 )

{
    "cgru_config": {
        "af_servername": "xxxxx",
        "OS_unix": {
            "af_rendercmds": ["vncviewer @ARG@"],
            "af_store_folder":"/var/tmp/afanasy",
            "af_rendercmds_admin": ["terminal -x \"ssh @ARG@\"",
                                    "terminal -x \"ssh root@@ARG@\"",
                                    "ssh root@@ARG@ \"service afrender stop;service afrender start\"",
                                    "ssh root@@ARG@ \"service afrender start\"",
                                    "ssh root@@ARG@ \"pgrep mantra | xargs kill -SIGSTOP\"",
                                    "ssh root@@ARG@ \"pgrep mantra | xargs kill -SIGCONT\"",
                                    "ssh root@@ARG@ \"pgrep hython | xargs kill -SIGSTOP\"",
                                    "ssh root@@ARG@ \"pgrep hython | xargs kill -SIGCONT\""
                                   ]
        },
        "OS_windows":{
            "pathsmap":[["Z:/","/mutter/"],
                        ["C:/Program Files/","/usr/local/"]]
        },
        "af_thumbnail_cmd": "oiiotool \"%(image)s\" -ch R,G,B --resize 0x100 -o  \"%(thumbnail)s\"",
        "af_task_update_timeout": 240,
        "af_monitor_zombietime": 59,
        "af_render_zombietime": 60,
        "af_render_heartbeat_sec": 6,
        "af_render_up_resources_period":3,
        "af_render_cmd_shutdown": "$CGRU_LOCATION/utilities/render_shutdown_guard '/sbin/poweroff'",
        "af_render_cmd_reboot": "$CGRU_LOCATION/utilities/render_shutdown_guard '/sbin/reboot'",
        "af_errors_forgivetime":86400,
        "af_server_linux_epoll":1,
        "af_server_sockets_processing_threads_num":20,
        "af_server_sockets_readwrite_threads_num":20,
        "previewcmds": ["rv \"@ARG@\"", "iv \"@ARG@\""],
        "pswd_visor":"9c50c09bd37e8adddc3e0362e14c2695",
        "pswd_god":"a5a75adbdf0e510fd26d7c9a27ef384e",
        "af_solving_tasks_speed":-1,
        "af_task_default_capacity":6
    }
}

Out serveroutput data is a little sensible, because the project names are also written into job names. But here is a little anonymization

Its hard to correlate the errors here to the not-replying renders. Because we cant find the exact time a render was causing issues.:

Mon 12 Feb 19:41.08: WARNING Client has NOT closed socket first: 10.102312.8.1464:7821 SFD:10 S:SWaiting REQ: TJSON[37]: 10.10.841.164:36126 ANS: TJSON[21388936]: Empty address
Mon 12 Feb 19:41.08: Server load profiling:
Clients per second: 756.62, Now: 480 (processed 819200 connections in last 1082.72 seconds).
Prep: 344.93, Proc: 9.95, Post: 68.78, Total: 423.66 ms.

Mon 12 Feb 19:41.31: Render:  xxxxx08058@xxxxx[138] unix linux 64 3.3.0 10.10.10.216 ON
Mon 12 Feb 19:41.31: Render Offline:  host07125xxx@xxxx[354] unix linux 64 3.3.0 10.10.11.160 off N
Mon 12 Feb 19:41.38: Render:  xxx04023xxxx@xxxx[788] unix linux 64 3.3.0 10.40.10.191 ON
Mon 12 Feb 19:41.46: WARNING Client has NOT closed socket first: 10.1232323:125 SFD:780 S:SWaiting REQ: TJSON[40]: 10.1232323:12345 ANS: TJSON[1186126]: Empty address
Mon 12 Feb 19:42.08: WARNING Client has NOT closed socket first: 10.123232323:219 SFD:764 S:SWaiting REQ: TJSON[37]: 10.1232323:12425 ANS: TJSON[21390240]: Empty address
Mon 12 Feb 19:42.08: WARNING Client has NOT closed socket first: 10.123232323:2123 SFD:11 S:SWaiting REQ: TJSON[37]: 10.1232323:12523 ANS: TJSON[21390240]: Empty address
Mon 12 Feb 19:42.38: Deleting a job: "xxxxxx"[3407]: xxxxx@host08805[950] - 9072 bytes.
Mon 12 Feb 19:42.38: Deleting a job: "xxxxxx"[2324]: xxxxx@host08805[950] - 9072 bytes.
Mon 12 Feb 19:42.40: Monitor deregister: xxxx@hostxxxx[5] e'monitor/24.02.08 af 3.3.0' 10.50.10.185
Mon 12 Feb 19:42.41: Render Offline:  xxxxxxx@xxxxx[685] unix linux 64 32.3.0 110.240.10.2428 off
Mon 12 Feb 19:43.07: WARNING Client has NOT closed socket first: 2310.10.8.164:226483 SFD:9 S:SWaiting REQ: TJSON[37]: 10.10.8.164:349768 ANS: TJSON[21386678]: Empty address
Mon 12 Feb 19:43.26: Job registered: "xxxxxxx"[290]: xxxxxx@hostxxx[27] - 40660 bytes.
Mon 12 Feb 19:43.26: Job registered: "xxxxxxx"[293]: xxxxxx@hostxxx[27] - 40660 bytes.
Mon 12 Feb 19:44.05: Render Offline:  xxxxxxx@xxxxx[685] unix linux 64 32.3.0 110.240.10.2428 off N
Mon 12 Feb 19:44.07: WARNING Client has NOT closed socket first: 140.10.18.1364:248291 SFD:9 S:SWaiting REQ: TJSON[37]: 10.10.8.164:33646 ANS: TJSON[21393017]: Empty address
Mon 12 Feb 19:44.10: Render:  host07123235[354] unix linux 64 344.333.0 101.10.411.2160 ON

If I understand you correctly, you think that the renders are kind of DDOS'ing our server. By reducing the heartbeat (https://cgru.readthedocs.io/en/latest/afanasy/pools.html#heartbeat-sec) of the rootpool to 2 or 3 sec, we will slow down the total numbers of connection attempts and reducing our network load?

Thanks a lot Jan

timurhai commented 7 months ago

Since pools appear (v >= 3.0.0) such render parameters as af_render_zombietime, af_render_up_resources_period, af_render_heartbeat_sec moved from config to pool parameters. This way we can configure them per pool. So in your config such parameters has no effect. Have you set them on pool? If no, you have 1 sec heartbeat on the entire farm. Yes we can compare it with DDOS'ing. But I still can't understand, why render can connect and write to server, but can't read. So it does not falls to zombie, but stuck, so wastes afserver (farm) time.

eberrippe commented 7 months ago

I will try to set the heartbeat to 2-3 on our different sites and see if it changes the amount of stuck renders my script catches. nothing was set, so the default of 1 sec was taken.

But right, this might reduce the symptoms but does not solve the cause.

I will keep you posted.

ultra-sonic commented 7 months ago

Hey Timur,

I don't know how familiar you are with the MQTT protocol and how it implements a so called Quality of Service (QoS). This page describes it pretty well https://www.hivemq.com/blog/mqtt-essentials-part-6-mqtt-quality-of-service-levels/

I am pretty sure that if we could implement at least "QoS 1" (server acknowledging reception of "task finished" to the client and client resending "task finished" if server does not acknowledge it after a certain time) it would solve this very issue.

What do you think?

timurhai commented 7 months ago

Now render updates server in a cycle. It will send that task is finished or anything else till server says that it got it.

But as far as i understand the problem, render can't read any server answer (stuck situation). So render can't receive any information from server, so server can't do anything with render too.

You say, that only afrender restart solves problem. So afrender should "decide" to exit manually, am i right? We can count such socket read fails, for example.

eberrippe commented 7 months ago

I suppose an immediate renderservice restart would be the best solution if we cant resolve the problem at its root.

Please keep in mind that sometimes other tasks can start in parallel on the stuck render and renders perfectly fine. These would be killed as well.

ultra-sonic commented 7 months ago

no restarts please...we must fix the root cause and not patch the issue

lithorus commented 7 months ago

I think the first step is to be able to re-produce it in a reliable way.

timurhai commented 6 months ago

Hello! I want to notice one important detain in this issue. When render stucks it is online and can run tasks that does not stucks? So not the entire render stucks, just one task stucks, while the entire render works fine as before? That task is DONE on server or 99% or any %% and any state?

ultra-sonic commented 6 months ago

correct! the task is stuck and the client happily accepts new tasks and renders them as if nothing ever happened. The stuck task is not DON on the server - it is still RUN

lithorus commented 6 months ago

Most of them time they are at not even at 0%. That's how I usually detect the. They have started, but don't have a % yet.