HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Resource Request transform under-requests glideins in certain circumstances #197

Open DmitryLitvintsev opened 4 years ago

DmitryLitvintsev commented 4 years ago

'I am currently seeing the following issue:

--+---------------------------------------------------------------------------------------------------------------------+----------+ Found in channel cms_job_classification +----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------+ | | Frontend_Group | Job_Bucket_Criteria_Expr | Site_Bucket_Criteria_Expr | Totals | |----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------| | 0 | cms_jetstream_passthrough | x509UserProxyVOName=='cms' and (DESIRED_Sites.str.contains('T3_US_OSG')) | [u"(GLIDEIN_CMSSite=='T3_US_OSG') and GLIDEIN_Site=='JetStreamTACC' and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 | | 1 | cms_tacc_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_TACC') and (REQUIRED_OS=='rhel7' or REQUIRED_OS=='any') | [u"(GLIDEIN_CMSSite=='T3_US_TACC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 52466 | | 2 | cms_nersc_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel6') | [u"GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel6'"] | 41015 | | 3 | cms_nersc_passthrough_sl7 | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel7') | [u"GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel7'"] | 52466 | | 4 | cms_sdsc_passthrough | x509UserProxyVOName=='cms' and (DESIRED_Sites.str.contains('T3_US_SDSC')) | [u"(GLIDEIN_CMSSite=='T3_US_SDSC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 | | 5 | cms_xsede_passthrough | x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_PSC') | [u"(GLIDEIN_CMSSite=='T3_US_PSC') and GLIDEIN_Supported_VOs.str.contains('CMS')"] | 93481 | +----+---------------------------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+----------+

93000 idle jobs overall, including 41015 for SL6.

The mapping of DE/FE groups to factory entries is 1:1 i.e cms_tacc_passthrough -> CMSHTPC_T3_US_TACC (sl7 only)

cms_xsede_passthrough -> CMSHTPC_T3_US_Bridges (both)

cms_nersc_passthrough -> CMSHTPC_T3_US_NERSC_Cori_KNL (sl6 only)

cms_nersc_passthrough_sl7 -> CMSHTPC_T3_US_NERSC_Cori_KNL_SL7 (sl7 only)

cms_jetstream_passthrough -> OSG_US_TACC_JETSTREAM (both)

cms_sdsc_passthrough -> CMSHTPC_T3_US_SDSC-osg_comet_frontend

For purposes of this ticket the one in question is CMSHTPC_T3_US_NERSC_Cori_KNL, the SL6 NERSC entry.

from cms_resource_request.log we get:

2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - -------------------------------------------- 2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Processing glidein requests for the FE Group: cms_nersc_passthrough 2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Frontend Group cms_nersc_passthrough job query: x509UserProxyVOName=='cms' and DESIRED_Sites.str.contains('T3_US_NERSC') and (REQUIRED_OS=='rhel6') 2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - Frontend Group cms_nersc_passthrough site matching expression : GLIDEIN_CMSSite=='T3_US_NERSC' and GLIDEIN_Supported_VOs.str.contains('CMS') and GLIDEIN_REQUIRED_OS=='rhel6' 2019-12-16 11:20:29,793 - root - glidein_requests - 43903 - GlideinRequestManifests - INFO - -------------------------------------------- 2019-12-16 11:20:29,801 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Number of credentials found from the configuration 2 2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Jobs found total 51024 idle 41015 (good 41015, old(10min 40310, 60min 38280), grid 41015, voms 41015) running 10009 2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Group slots found total 0 (limit 60000 curb 59000) idle 0 (limit 60000 curb 59000) running 0 2019-12-16 11:20:30,120 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Frontend slots found total 641 (limit 170000 curb 167000) idle 4 (limit 35000 curb 25000) running 641 2019-12-16 11:20:30,121 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Overall slots found total 7339 (limit 170000 curb 167000) idle 800 (limit 35000 curb 25000) running 6684 2019-12-16 11:20:32,564 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Number of credentials found: 2 2019-12-16 11:20:32,660 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Jobs in schedd queues | Slots | Cores | Glidein Req | Factory Entry Information 2019-12-16 11:20:32,660 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Idle (match eff old uniq ) Run ( here max ) | Total Idle Run Fail | Total Idle Run | Idle MaxRun | State FigureOfMerit EntryName 2019-12-16 11:20:32,673 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 0(mc 0, min 0), available slots 0 2019-12-16 11:20:32,674 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered: NoEffectiveIdle: no glidein is needed 2019-12-16 11:20:32,679 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 60000) | 0 0 0 0 | 0 0 0 | 0 0 | Down 0.0060 CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_fermifactory02@gfactory_service_fermifactory02@fermifactory02.fnal.gov 2019-12-16 11:20:32,690 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0 2019-12-16 11:20:32,691 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered:
2019-12-16 11:20:32,696 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 41015(41015 41015 40310 0) 10009( 0 60000) | 0 0 0 0 | 0 0 0 | 17 82 | Up 0.0024 CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02@fermifactory02.fnal.gov 2019-12-16 11:20:32,705 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Request CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 0(mc 0, min 0), available slots 0 2019-12-16 11:20:32,705 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - Limits triggered: NoEffectiveIdle: no glidein is needed 2019-12-16 11:20:32,709 - root - glide_frontend_element - 43903 - GlideinRequestManifests - INFO - 0( 0 0 0 0) 0( 0 60000) | 0 0 0 0 | 0 0 0 | 0 0 | Down 0.0012 CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_fermifactory02@gfactory_service_fermifactory02@fermifactory02.fnal.gov

So we are requesting but 17 idle glideins for a group in which there is 41015 idle jobs.CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0

It should be pointed out that the job content of these six different groups is almost the same, differing only by OS, some of which take both, some of which just takes one or tthe other. So it reports that 10009 of this type of job are already running somewhere else in the global pool. That statement is true.. but it greatly cuts down the numbers of glideins that we would like submitted to NERSC in this case. If the DE considers this group in isolation, we have need for 603 glideins worth of cores.. one third of that should be 201.

In previous time periods when I have been looking at the decision engine sometimes we will see the line CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory02@gfactory_service_fermifactory02: prop jobs 41015(mc 27.0, min 0), available slots 0

the "mc" count will go much higher than 27 and all of a sudden a bunch of a few hundred glideins will be requested and then it goes back down to these levels.

Please investigate why the count of glideins requested is artificially low and if there is any reason that could explain the flucuation.

We have only seen this behavior thus far in the decision engine (standard library version 0.3.14 which is the current version). There is enough similarity in the glidein request code to make me believe it must also happen in the frontend but I have no direct evidence of that. Factory version is 3.4.5 if it matters.

Steve Timm

DmitryLitvintsev commented 4 years ago

imported GitLab: https://hepcloud-git.fnal.gov:8443/hepcloud/decisionengine_modules/issues/185

StevenCTimm commented 4 years ago

Just to note that these behaviors continue in the current production issue of the decision engine 1.1

StevenCTimm commented 4 years ago

Just also to note that the effect is occurring in production as we speak, in which there are two jobs for GM2 but no glideins submitted at all. In the case of more than one factory entry matched to one group, we tend to split them out in such a way that no glideins are requested from either entry.

StevenCTimm commented 4 years ago

There are three major cases currently that affect production on a regular basis. 1) the issue mentioned above--multiple entries match the group, N(jobs) is less than a full node, nothing gets submitted 2) Cases where nodes run out of memory before they run out of cores. (Knights landing nodes at NERSC, 68 cores, 96GB RAM) Glideinwms logic assumes 2GB per core for all calculations. So it inaccurately reports that there are 23 free cores on all of these nodes, and thinks that it does not need to request any more. 3) Cases where the glidein is old and cannot match any more jobs because all remaining jobs are too long to finish in the remaining glidein time. Nevertheless these nodes, often partially drained, count as idle nodes and keep new glideins from getting requested. This affects HPC and Cloud more that grid resources because we tend to have large waves of jobs starting all at once.

StevenCTimm commented 4 years ago

There is now a cross-referenced issue in the glideinwms tracker.

https://cdcvs.fnal.gov/redmine/issues/24610

StevenCTimm commented 4 years ago

Marco claims in stakeholder meeting this will be fixed in glideinwms 3.6.3. Need to figure out how he plans to do that.