HEPCloud / decisionengine

HEPCloud Decision Engine framework
Apache License 2.0
6 stars 25 forks source link

de-client --stop-channel / --start-channel doesn't work in 2.0rc2 #634

Closed StevenCTimm closed 1 year ago

StevenCTimm commented 2 years ago

de-client --stop-channel resource_request took more than an hour and still didn't shut down the channel. This may simply be due to the fact that we are trying to shut down ALL the publishers, there was no data for any of the publishers to have, and they are all configured now to retry a large number of times.

DE 1.7 and earlier had a configurable timeout after which the shutdown of the channel would give up and just kill it. DE 2.0 has it too but it seems to not be working.

Note also that systemctl stop decisionengine also takes much too long, 2-3 minutes on average.

StevenCTimm commented 2 years ago

Some economy of time may be saved by combining the function of several publishers into one. Seven of the nine publishers in the current configuration are just writing to Graphite and that underlying publisher, I believe, has the capacity to publish multiple data blocks within the same call, if configured properly. If it doesn't it ought to be able to be modified to do so.

StevenCTimm commented 2 years ago

It should also be noted that although the nominal reason for trying to run all the publishers at shutdown time is to de-advertise the decision engine classads from the factory, that functionality is not working either at the moment. (this was a very hard problem that was at one point solved). The classads persist even though the long shutdown is happening.

knoepfel commented 2 years ago

@StevenCTimm, after doing some digging, I think part of the problem is that the behavior of --stop-channel was changed in DE 1.6.0 to always perform a clean shutdown of the channel. If the desire is to just kill the channel after so many seconds, then --kill-channel should be used instead. From de-client -h:

$ de-config -h
...
Channel-specific options:
  --start-channels      start all channels
  --stop-channels       stop all channels
  --start-channel <channel name>
  --stop-channel <channel name>
                        Attempt clean shutdown of channel.
  --kill-channel <channel name>
                        Same as --stop-channel, except the channel process
                        will be killed once the server's configured shutdown
                        timeout window is exceeded
  -f, --force           May be used with --kill-channel to immediately kill
                        the channel process
  --timeout <seconds>   May be specified with --kill-channel to override the
                        DE server's configured timeout window or max time to
                        wait for --block-while.
...

This doesn't address all of the issues that have been reported here, but it should explain the behavior for some of them.

knoepfel commented 2 years ago

Ah...the reason systemctl stop decisionengine is taking so long is because the default timeout of 10 seconds is being applied to each source in addition to the channel. We'll have to fix that.

knoepfel commented 2 years ago

PR #636 should address the issue of systemctl stop decisionengine taking 2-3 minutes.

StevenCTimm commented 2 years ago

The initial problem, namely that de-client --stop-channel ; de-client --start-channel, didn't work, is still the case. systemctl stop decisionengine still also takes far too long (about 2 minutes)

knoepfel commented 2 years ago

@StevenCTimm, did you see my comment above re. --stop-channel vs. --kill-channel? Also, --start-channel will block until the channel is STEADY. What is the desired behavior?

Surprised to see that systemctl stop decisionengine is still taking 2 minutes. That should have been addressed with #636, but may have to check that.

StevenCTimm commented 2 years ago

this is now addressed in PR 648 as soon as that is merged and in a release we should be good.