LogicReinc / LogicReinc.BlendFarm

A stand-alone Blender Network Renderer
GNU General Public License v3.0
441 stars 38 forks source link

During multibox rendering, boxes (one or more) sometimes disconnects with "null" #34

Closed kkaveny closed 2 years ago

kkaveny commented 2 years ago

I have a test project I have created and have all of my boxes connect just fine. However, during the rendering process, one or more boxes in my render farm sometimes fails the render and disconnects from the cluster with a vague "null" error. I have attached a screenshot of the issue:

renderfarm_bug_screenshot

All boxes are connected through wifi within my home network. Ignore the other errors because I had to stop the rendering to capture the "null" error as high-lighted in my screenshot

LogicReinc commented 2 years ago

Odd that it doesn't give a proper error, as that implies no error happened yet an exception was thrown.

My best guess is that the nodes have an unstable connection, seeing how they connect over wifi. BlendFarm expects a stable connection for rendering, reconnecting currently isn't implemented.

I'd ask if you could see if this happens to devices that are connected by cable, because if that happens something else is going on. Also are machines that fail a particular operating system? Any common traits between failing machines?

Reconnecting is planned in the future, but seeing how most powerful machines are connected by cable it hasn't had any priority.

EDIT: After checking source, I do see a scenario where you would receive "null" as error, which is when no image is returned from awaiting a subtask to complete, which indeed suggests disconnecting.

kkaveny commented 2 years ago

All of these node machines are Pop_OS! v21.04 with Ryzen 5600X + RTX 3080Ti.

You might be right as far as connection being unstable and not re-connecting. Internet where I'm located is good, but not great at times.

Anyway, it was a good experiment using wifi and I wanted to see how this reacts first before I invested into a network switch and wire them all together -- which I will probably end up doing anyway. Although, I would still advocate we add re-connection feature in the near future. To me, without re-connection seems risky because all sorts of other unknown things can happen.

LogicReinc commented 2 years ago

Since it has been a while I assume no further testing happened. Reconnection is planned at some point, but in most networks this would not be an issue so other issues have priority.

kkaveny commented 2 years ago

I know it has been a little while, but I want to contribute an idea that I recently had. DDNS might be a thing to explore that will likely solve this issue.

https://en.wikipedia.org/wiki/Dynamic_DNS https://www.cloudns.net/blog/what-is-dynamic-dns/ https://www.noip.com/

I've been struggling to figure this out for months now, however, these articles seem to explain exactly my situation. It appears that the nature of wi-fi is that you're "borrowing" an ip for a short period of time before it changes especially over wi-fi.

No-IP is what I'm testing now and I find it much more stable. Right now, I'm using a host name, such as somehostname.ddns.net, somehostname.hopto.org, etc. Or if I'm accessing a certain port it would be somehostname.ddns.net:3000. It would be a great enhancement that in addition to using just ip addresses, that it can also use DDNS host names.

LogicReinc commented 2 years ago

@kkaveny Hey Kelly, These two are not related, and would not solve the issue. no-ip solutions are commonly used if you are hosting things over the internet, not a local network. Within a wifi network a machine generally keeps the same ip (all ips are borrowed by default unless you use a static ip, which you can do on wifi as well btw). And if it were to change, it would still interrupt the traffic and thus not solve the problem. Without DDNS you could configure the device to have a static IP and you'd have the same effect without DDNS. (Also DDNS is generally an external tool, not part of the application)

From what I've read here so far the problem is simply an unstable connection which can only really be solved by having the client be resistant to such failures. I don't know when I can solve this as I'm currently busy with other projects. I'd say maybe this month but I can't make those promises atm.

Note that even if this is solved, you will still run into issues like slow sync rates etc if you're using it over wifi. Its generally assumed you run BlendFarm over ethernet for this reason.

On another side note, another possible cause for an "unstable" connection might be if the machine is in between two different wifi points of the same network, thus causing the device to keep switching between the two. Though this is only an issue if you use such mesh setups.

LogicReinc commented 2 years ago

@kkaveny V1.1.3 adds basic reconnection logic, so that during a render it will attempt to reconnect and restart that piece of work. This may help in your situation. If the connection however is entirely unstable and it can't even finish a single piece of work it may not be worth it though.