HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Decision Engine requests Glideins in the factory despite jobs being completed #490

Open namrathaurs opened 7 months ago

namrathaurs commented 7 months ago

The very first observation of this was during some activity on a Decision Engine (DE) instance that talks to an ITB factory (798, running GlideinWMS 3.10.5-1). The glideins were being requested even though the job that was submitted by the DE had completed. Verified to ensure that the requests were not coming from either the DE client in question or some other clients because of jobs being in the respective job queues. A request coming from the client includes two numbers: ReqMaxGlideins and ReqIdleGlideins which are of interest to understand the underlying behavior. Upon further investigation of the glideclient and glidefactoryclient classads, it was found that:

  1. When jobs were submitted and were present in the DE queue — condor_q shows jobs in running and idle state (5 processes submitted and each one has a sleep for 10 minutes):
    # From the glideclient classad:
    [root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
    WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
    For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
    GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
    ReqEncIdentity = "4a098979b299b9c2cacf843ac6d3e4a4a34384b1f24b0189e4d2e7ec6d52d5b2dee40e62c3942db795ea32f53be8d9c4"
    ReqEncKeyCode = "64e12146066e3efbe3480fe726ecc4cd0fd196cd712640c7c1cf96a87b56dbdb8ec6c35b6f28c68f551ea639b2642532240422397730765e95bebfb8e0e843bfe0b964aac909c31f5365e586f6d2aef3ea93bebe9d2f9abba786c0bef344484c6e128c06b881a1d31e31a3c01bf782780aaf52afa7c02238c379fb32b7f8a35dd11ae7b534a03f7b689bdf795d5be339457a77555fb75998d838524d0203268e0400d861b1a00bcffd3881fe76ddba3a9e864b37618957ef87f052bac6aeda07ff445bc7af791ed921a237c9859120125c69b7613e9c0fa462c7f4649757e05f7e8cdd2508bc06059aace4642ae3bb1aa40db54d34ae2496104383f26d5ac124"
    ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
    ReqIdleGlideins = 1
    ReqMaxGlideins = 6
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 6
GlideinMonitorRequestedMaxGlideins = 6
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 6
GlideinMonitorTotalRequestedMaxGlideins = 6
  1. When there were 2 completed jobs and 3 were in running state in the DE:
    # From the glideclient classad:
    [root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
    WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
    For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
    GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
    ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
    ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
    ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
    ReqIdleGlideins = 1
    ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2
  1. When submitted jobs in the DE completed — DE queue was empty upon doing a condor_q:
    # From the glideclient classad:
    [root@factoryhost ~]# condor_status -any 950589_ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
    WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
    For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
    GlideinParamGLIDECLIENT_ReqNode = "factoryhost.fnal.gov"
    ReqEncIdentity = "f9ecce3a2fde6d39f57da0198e4dac73b57affcf92d0671b21a34959c1b005bc6c92d2300ff4998558b027365e18dbe0"
    ReqEncKeyCode = "8d86570f2f6b737b03798f7fcb053df7f3d8a755c91620e87073cbba80013ed2e244c870ff16ac3482bcc1f3f625119e8f16d9679317a52f98108d9e7987ce3b75b428603117e215c463f128206011110ab109ef1e6edab90dec833eb3cb9e10b8618e547eadb50d3381f49860b04acb912c3ba574ed4b4e160f103c30dee8e9d31c1a5c6e5f07e88e856c905519a574cf169ef0bdf9e2359088f3361562c04259e77064c8b5516813c793b69c06531e78dd9f79f26f36fac2acb1fa1b4d3386be96c42594aae20d9168822d2111d9ac023e9deffab625139289f546b881f1f56519c87966f95fe64436ad5f20c8eaf5a687e0b38cd1806f53cd76283cf1a1eb"
    ReqGlidein = "ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service"
    ReqIdleGlideins = 1
    ReqMaxGlideins = 2
# From the glidefactoryclient classad:
[root@factoryhost ~]# condor_status -any ITB_CE_EL9_SciToken@gfactory_instance@gfactory_service@declient.de_test -l | grep Req
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
GlideinMonitorRequestedIdle = 1
GlideinMonitorRequestedIdleCores = 1
GlideinMonitorRequestedMaxCores = 2
GlideinMonitorRequestedMaxGlideins = 2
GlideinMonitorTotalRequestedIdle = 1
GlideinMonitorTotalRequestedIdleCores = 1
GlideinMonitorTotalRequestedMaxCores = 2
GlideinMonitorTotalRequestedMaxGlideins = 2

After excessively requesting glideins, at some point, the glideclient classad vanishes from the factory after which no more glideins are requested in the factory. Since this classad vanishes after its expiration, glideins not being requested makes sense since the classad is no longer present.

This very same behavior of glideins being requested even though the DE job queue is empty was also observed in a couple of instances:

namrathaurs commented 7 months ago

Discussed my findings/observations with Marco and following are his inputs:

Initially, we thought this might be something on the GlideinWMS factory and this could be happening due to a bug. After the investigation, it seems more likely that DE could be requesting for more glideins. A suggestion provided was to thoroughly review DE code to avoid this scenario.