HEPCloud / decisionengine_modules

Apache License 2.0
2 stars 19 forks source link

Resource request plugin should remove the classads when shutdown #200

Open DmitryLitvintsev opened 4 years ago

DmitryLitvintsev commented 4 years ago

Imported from GitLab: https://hepcloud-git.fnal.gov:8443/hepcloud/decisionengine_modules/issues/149

StevenCTimm commented 4 years ago

The history of this is that this is something Parag worked on for a month or two while he was still here and never got it working. As the Decision Engine exits it should de-advertise its request classads but there were some problems trying to put all of that into the destructor method. This should be kept in mind--the module that advertises the classads is due to be totally reworked in the current phase of the project.

shreyb commented 4 years ago

Notes from a meeting we had:

  1. At the framework level, the Publisher class (https://github.com/HEPCloud/decisionengine/blob/master/framework/modules/Publisher.py) should have a new method defined, called "shutdown", that can be overridden by child classes
  2. In the TaskManager code, once a TaskManager sees that it is supposed to shut down, it should finish its loops, and then loop through all of its publishers and run this new shutdown() method on all of them (probably at line 181 of TaskManager.py).
  3. We can then define shutdown methods on any inheriting Publishers that need them (in the immediate case, replacing the del method of the HTCondorManifests class mentioned above)

Note: (1) is done. Working on (2) now.

shreyb commented 4 years ago

All three items are done, and pass all CI tests. Will send Steve build artifacts from CI when he's available to test.

shreyb commented 4 years ago

Passed build artifacts to Steve. Waiting on his testing at his availability.

StevenCTimm commented 3 years ago

Attempted to test Shreyas' build artifacts failed due to problems that as far as I can tell are not related to any changes he made in the code. The decisionengine fails to advertise to the collector at all (with a value error) . This is very similar to a bug we saw in the 1.4 version and there is a chance that the origin may be the same.

Testing the unmodified 1.5rc0 rpm (which had not yet been shown to work even without Shreyas' modification) to see if it has the same problem. If it does I will open a new issue.

StevenCTimm commented 3 years ago

After resolving the issue with the test machine fermicloud117, which had nothing to do with the code, I installed Shreyas' branch again. started up the DE, it advertised classads to the factory just fine, then initiated a stop of the DE. Classads on their own take 1600s to go away after the DE is gone. With the new patch we see: last heard from timestamp is 1610390010 for the globals and 1610390014 for the glide clients. DE was off at 1610390239

The following shows from the resource_request.log at the time of shutdown:

2021-01-11 12:36:50,361 - root - TaskManager - 19742 - MainThread - ERROR - error in decision cycle(publishers) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 283, in decision_cycle a_f['actions'], a_f['newfacts'], data_block_t1) File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 480, in run_publishers publisher.worker.publish(data_block) File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 126, in publish dataframe = datablock.get(key) File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 271, in get return self.getitem(key, default=default) File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in getitem raise KeyError KeyError

There is nothing from this stack trace (or the surrounding debug logs) that indicates that we were in the new shutdown method when this happened. Rather it appears that the routine was called asynchronously at a time when some if not all of the data blocks needed to call the publisher may not be available.

It may be helpful at this point to consult Marco Mambelli or one of the other frontend developers. This functionality does work in the glideinwms frontend, although the code path is entirely different since the DE uses the python bindings to condor_advertise and the frontend does a system shell out to the condor_advertise binary.

Again I observed that it took 1600s for the classads to go away out of the collector, same as if there had been no patch.

Next debugging effort, (after talking to Marco) would be to add some debugging to see if the shutdown method is even being called by the framework at all. I have the test system configured at maximum debug level and there is no indication from any of the messages that we ever attempted to call the shutdown method of the publisher in question. But there is also no proof that we didn't. Also understanding the exception would be key.

StevenCTimm commented 3 years ago

Note the modified code remains on fermicloud117 and is available for further tests.

StevenCTimm commented 3 years ago

Discussion in this morning's meeting led me to see that overnight (Jan 13) the code as written had the resource_request channel go offline due to a network error, and in that case it did call the shutdown method of the publisher, you could see it in the logs. So there may be a different program flow when systemctl stop decision-engine is called as opposed to when the framework decides to declare a channel offline. Since the problem happened in the middle of the night I was not able to verify if the classads were successfully de-advertised when the shutdown was called.

StevenCTimm commented 3 years ago

2021-01-13 22:01:37,347 - root - TaskManager - 21943 - MainThread - ERROR - error in decision cycle(publishers) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 283, in decision_cycle a_f['actions'], a_f['newfacts'], data_block_t1) File "/usr/lib/python3.6/site-packages/decisionengine/framework/taskmanager/TaskManager.py", line 480, in run_publishers publisher.worker.publish(data_block) File "/usr/lib/python3.6/site-packages/decisionengine_modules/htcondor/publishers/publisher.py", line 126, in publish dataframe = datablock.get(key) File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 271, in get return self.getitem(key, default=default) File "/usr/lib/python3.6/site-packages/decisionengine/framework/dataspace/datablock.py", line 359, in getitem raise KeyError KeyError 2021-01-13 22:01:37,365 - root - TaskManager - 21943 - MainThread - INFO - Task Manager 30F3EB74-288A-474D-BB50-4533A07EF929 received stop signal and exits 2021-01-13 22:01:37,365 - root - TaskManager - 21943 - AWSBurnRateSourceProxy - INFO - received stop_running signal for AWSBurnRateSourceProxy 2021-01-13 22:01:37,365 - root - TaskManager - 21943 - AWSBurnRateSourceProxy - INFO - stopped AWSBurnRateSourceProxy 2021-01-13 22:01:42,375 - root - TaskManager - 21943 - BillingInfoSourceProxy - INFO - received stop_running signal for BillingInfoSourceProxy 2021-01-13 22:01:42,376 - root - TaskManager - 21943 - BillingInfoSourceProxy - INFO - stopped BillingInfoSourceProxy 2021-01-13 22:01:47,380 - root - TaskManager - 21943 - FigureOfMeritSourceProxy - INFO - received stop_running signal for FigureOfMeritSourceProxy 2021-01-13 22:01:47,382 - root - TaskManager - 21943 - FigureOfMeritSourceProxy - INFO - stopped FigureOfMeritSourceProxy 2021-01-13 22:01:52,381 - root - TaskManager - 21943 - FinancialParameters - INFO - received stop_running signal for FinancialParameters 2021-01-13 22:01:52,382 - root - TaskManager - 21943 - FinancialParameters - INFO - stopped FinancialParameters 2021-01-13 22:01:57,386 - root - TaskManager - 21943 - GCEBillingInfoSourceProxy - INFO - received stop_running signal for GCEBillingInfoSourceProxy 2021-01-13 22:01:57,387 - root - TaskManager - 21943 - GCEBillingInfoSourceProxy - INFO - stopped GCEBillingInfoSourceProxy 2021-01-13 22:02:02,091 - root - TaskManager - 21943 - JobClusteringSourceProxy - INFO - stopped JobClusteringSourceProxy 2021-01-13 22:02:02,391 - root - TaskManager - 21943 - GceBurnRateSourceProxy - INFO - received stop_running signal for GceBurnRateSourceProxy 2021-01-13 22:02:02,391 - root - TaskManager - 21943 - GceBurnRateSourceProxy - INFO - stopped GceBurnRateSourceProxy 2021-01-13 22:02:07,391 - root - TaskManager - 21943 - GceFigureOfMeritSourceProxy - INFO - received stop_running signal for GceFigureOfMeritSourceProxy 2021-01-13 22:02:07,392 - root - TaskManager - 21943 - GceFigureOfMeritSourceProxy - INFO - stopped GceFigureOfMeritSourceProxy 2021-01-13 22:02:12,397 - root - TaskManager - 21943 - NerscAllocationInfoSourceProxy - INFO - received stop_running signal for NerscAllocationInfoSourceProxy 2021-01-13 22:02:12,398 - root - TaskManager - 21943 - NerscAllocationInfoSourceProxy - INFO - stopped NerscAllocationInfoSourceProxy 2021-01-13 22:02:17,402 - root - TaskManager - 21943 - NerscFigureOfMeritSourceProxy - INFO - received stop_running signal for NerscFigureOfMeritSourceProxy 2021-01-13 22:02:17,404 - root - TaskManager - 21943 - NerscFigureOfMeritSourceProxy - INFO - stopped NerscFigureOfMeritSourceProxy 2021-01-13 22:02:22,407 - root - TaskManager - 21943 - FactoryEntriesSourceProxy - INFO - received stop_running signal for FactoryEntriesSourceProxy 2021-01-13 22:02:22,408 - root - TaskManager - 21943 - FactoryEntriesSourceProxy - INFO - stopped FactoryEntriesSourceProxy 2021-01-13 22:02:27,413 - root - TaskManager - 21943 - FactoryGlobalManifests - INFO - received stop_running signal for FactoryGlobalManifests 2021-01-13 22:02:27,413 - root - TaskManager - 21943 - FactoryGlobalManifests - INFO - stopped FactoryGlobalManifests 2021-01-13 22:02:37,421 - root - TaskManager - 21943 - JobQSourceProxy - INFO - received stop_running signal for JobQSourceProxy 2021-01-13 22:02:37,422 - root - TaskManager - 21943 - JobQSourceProxy - INFO - stopped JobQSourceProxy 2021-01-13 22:02:42,427 - root - TaskManager - 21943 - StartdManifestsSourceProxy - INFO - received stop_running signal for StartdManifestsSourceProxy 2021-01-13 22:02:42,428 - root - TaskManager - 21943 - StartdManifestsSourceProxy - INFO - stopped StartdManifestsSourceProxy 2021-01-13 22:02:57,443 - root - publisher - 21943 - MainThread - INFO - Called HTCondorManifests.shutdown 2021-01-13 22:02:57,443 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclient classads from collector_host cmssrv258.fnal.gov with constraint (glideinmytype == "glideclient") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough")) 2021-01-13 22:03:33,529 - root - publisher - 21943 - MainThread - INFO - Advertising glideclient classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:03:33,531 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclient classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:03:33,532 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds 2021-01-13 22:04:09,607 - root - publisher - 21943 - MainThread - INFO - Advertising glideclient classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:04:09,608 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclient classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:04:09,608 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries 2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclient classads from collector_host cmssrv258.fnal.gov 2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - INFO - Called HTCondorManifests.shutdown 2021-01-13 22:04:09,609 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclientglobal classads from collector_host cmssrv258.fnal.gov with constraint (glideinmytype == "glideclientglobal") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_nersc_passthrough_sl7,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_sdsc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_xsede_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_wilson_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl6,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl7")) 2021-01-13 22:04:45,691 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:04:45,693 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:04:45,694 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds 2021-01-13 22:05:21,775 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:05:21,777 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host cmssrv258.fnal.gov 2021-01-13 22:05:21,777 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries 2021-01-13 22:05:21,778 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclientglobal classads from collector_host cmssrv258.fnal.gov 2021-01-13 22:05:21,778 - root - publisher - 21943 - MainThread - INFO - Invalidating glideclientglobal classads from collector_host gfactory-2.opensciencegrid.org with constraint (glideinmytype == "glideclientglobal") && (stringlistmember(ClientName, "fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_jetstream_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_nersc_passthrough_sl7,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_sdsc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_tacc_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.cms_xsede_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.covid19_wilson_osg_passthrough,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_google,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_all_but_nersc,fermicloud117-fnal-gov_hepcloud_decisionengine.fife_aws_local,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl6,fermicloud117-fnal-gov_hepcloud_decisionengine.nova_nersc_passthrough_sl7")) 2021-01-13 22:05:57,851 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org 2021-01-13 22:05:57,853 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org 2021-01-13 22:05:57,853 - root - retry_function - 21943 - MainThread - WARNING - Function _condor_advertise failed with Invalid command INVALIDATE_AD_GENERIC on try 0/1. Sleeping 2 seconds 2021-01-13 22:06:33,935 - root - publisher - 21943 - MainThread - INFO - Advertising glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org 2021-01-13 22:06:33,937 - root - publisher - 21943 - MainThread - ERROR - Error running INVALIDATE_AD_GENERIC for glideclientglobal classads to collector_host gfactory-2.opensciencegrid.org 2021-01-13 22:06:33,938 - root - retry_function - 21943 - MainThread - ERROR - Error Function _condor_advertise giving up with Invalid command INVALIDATE_AD_GENERIC after 1 retries 2021-01-13 22:06:33,938 - root - publisher - 21943 - MainThread - ERROR - Error running invalidating glideclientglobal classads from collector_host gfactory-2.opensciencegrid.org

StevenCTimm commented 3 years ago

Note that when that channel failed there was a print statement in startup.log already saying

Called Publisher.shutdown Called Publisher.shutdown Called Publisher.shutdown

One for each of the 5 channels that went offline.

I have restarted that DE now but saved the file as it was.

StevenCTimm commented 3 years ago

Note also that the above resource_request.log shows that the deadvertise command didn't work, but only because the network was down and it couldn't actually reach the factory to do the advertise.

shreyb commented 3 years ago

Current status: This has been split off into two portions: one for the framework and one for the modules.

1) Framework - DONE We implemented a new method in the Publisher base class, called shutdown, which is meant to run any actions necessary before a channel is actually stopped. This is invoked in the TaskManager running a channel via a set_to_shutdown method that runs a TaskManager's channel's shutdown methods. The set_to_shutdown method is called from the DecisionEngine when a channel is stopped. This code has been merged into 1.6 (decisionengine PR #334), and after a quick rebase and fix to logging, will have a PR.

2) Modules - IN PROGRESS The aforementioned shutdown method is correctly called when the resource_request channel is stopped, but the actual deadvertising doesn't happen. We'll need to continue to investigate why this is the case.