Azure / cyclecloud-gridengine

Example Azure CycleCloud Gridengine cluster template
MIT License
6 stars 12 forks source link

Possible fix to race condition with sge_inst #11

Closed mvrequa closed 3 years ago

mvrequa commented 4 years ago

Check this: You can change the name to Altair Grid Engine (that is the official name soon)

this might be available in 8.6.7, can't remember. SGE_EXECD_KEEP_TRYING_TO_GET_CONFIG        If  set to 1, keeps the sge_execd(8) from quitting if it can connect to the sge_qmaster(8) but does not get the configuration.  This        is the case e.g.  if the host the sge_execd(8) is running on is not yet configured as an execution host at the  sge_qmaster(8).   If        not set or if set to 0, the sge_execd(8) shows its normal behaviour, i.e.  it quits if it does not get the configuration.

this is in response to you saying there is a race condition - the execd will continue to try getting its configuration - where normally it would fail - because qmaster does not recognize it as an execution host.
mvrequa commented 3 years ago

Consider adding a retry on sge_inst -x block. To resolve the sge host not added error. (will this be annoying and cause unwanted delays for other failures)?

staer commented 3 years ago

Does the installer return a specific error code for this class of error? We could trap just a certain return code and go from there. If not, then i thin it's worth the delay because showing errors on every node that aren't errors is pretty annoying.

ryanhamel commented 3 years ago

Closing as the installation went through several related race condition fixes in 2.0.4-2.0.6. If this re-appears, let's reopen and investigate further.