clean up controlvm_periodic_work()[upstream]

selltc commented 8 years ago

Creating this ticket as a result of some observations concerns pointed out by @davidker:

I’m trying to wrap my head around the function controlvm_periodic_work, specifically the variable poll_count. Is it really just sleeping for the first 250 iterations of the function and then allows it to go through? Why 250? Why do we even need to sleep that long, to wait for other things to come through, I thought that was what clientregwait was (right above it). Just the whole process of poll_count is confusing me at the moment and I’m just waiting for the upstream to ask us about it.

Also what does handle_command_failed do? Are we supposed to error if it fails a second time? Why are we retrying it and I’m not sure how we every respond with failure if we always fail handle_command. Do we never actually fail it?

Maybe I’ve been up to long today (had to get up an hour earlier than normal), but I think there is improvement in this function.

Based upon this excerpt in the https://ustr-svn-1.na.uis.unisys.com/trac/spar/changeset/6443 / r6443 (9 years ago!) log comment, the purpose of poll_count is solely to make testing scenarios work better, therefore I think we can safely remove it:

This changeset also changes the way in which controlvm messages are simulated (for testing purposes) in the ApplianceOS environment. (The simulation applies when the visorchipset driver is started with testvnic=1 and testmsg=1, which are currently the default settings.) Previously, the simulated messages would not start until AFTER the controlvm channel was discovered. For CM2 testing, this meant that you actually needed to start the driver on the host side that was going to send the controlvm messages. This behavior was changed so that the simulated messages start automatically about 1 minute after the visorchipset driver is loaded, regardless of whether or not a controlvm channel is discovered. This provides the added benefit that testing can also be performed in non-CM2 environments.

I believe handle_command_failed attempts to recover from the scenarios where the controlvm channel queue temporarily fills up, resulting in signalinsert() errors. HOWEVER, we never hit this error condition until we started to use the controlvm channel for file transfer operations (CONTROLVM_TRANSMIT_FILE) in https://ustr-svn-1.na.uis.unisys.com/trac/spar/changeset/20223 / r20223. Of course, these file transfer operations are never used in guest environments, so we could arguably simplify things by removing this retry logic.

selltc commented 8 years ago

I'm getting cold feet now about removing the handle_command_failed logic. Just because we can't think of a current scenario where controlvm messages might need to be throttled, isn't a good reason to remove retry logic that has already been proven to work.

Getting rid of poll_count is a "no-brainer" though.

davidker commented 8 years ago

Good, my quick look at this earlier this week had my gut was agreeing with you. There are a lot of changes that need to happen for the handle_command_failed logic and I don't want to completely remove it. It was put there once because it was needed and worked.

selltc commented 8 years ago

Glad we agree.

Commit 25c8f86e6a3d98ea6e752eff2b239067381f7995 (staging: unisys: visorbus: remove unnecessary poll_count logic) regression tests and checkpatches clean.

davidker commented 8 years ago

Thanks Tim.

I've cherry-picked the patch over to upstream-next. I'm in the process of submitting it and Alex's patches. I'm switching this issue to queued/[upstream-next].

selltc commented 8 years ago

Thanks David. I should have issue #10 wrapped up shortly too.

davidker / unisys

clean up controlvm_periodic_work()[upstream] #15