UM-Bridge / umbridge

UM-Bridge (the UQ and Model Bridge) provides a unified interface for numerical models that is accessible from virtually any programming language or framework.
https://um-bridge-benchmarks.readthedocs.io/en/docs/
MIT License
32 stars 14 forks source link

Fix issue 48 #71

Open LennoxLiu opened 7 months ago

LennoxLiu commented 7 months ago

The issue

Issue #48 was about sporadic model crashes on Helix. When using load-balancer to start the servers, the server occasionally fails to start because the randomly selected port is in use.

Debugging

I found that some ports were vacant when checking and occupied when the server was trying to use them. It should be the case that some processes occupied the port just between the checking and using.

I noticed that some jobs did try to select a new port during the test, so the checking part should be correct.

Solution

I tried to occupy the port using nc -l $port & when checking and then release it just before the start of the server, but the releasing using fuser -k -n tcp $port was not stable (on Helix). Therefore, I didn't use this approach.

Instead, I added a timeout check in job.sh to see if the script waits for a server to respond exceeds $timeout seconds. If so, the script will call itself again to restart the server.

Since the issue #48 only happens (around) once every hundred times, the usage time should not increase significantly. Also, since there's a time limit for HQ jobs set in the job.sh, this retry won't repeat infinitely.

Just need to notice that if there are too many servers need to restart, it might be because that the $timeout set in job.sh is too small to start the server.

Other changes