Add HaveIdleWork to avoid excessive CPU cost for spinning.

m8pple commented 3 years ago

This is related to #235. Main changes are:

Add CommonBase::HaveIdleWork
Modify CommonBase::MPISpinner to check whether there is idle work, and if not back off down to spinning at 10hz.
Override HaveIdlework in various CommonBase derivatives to indicate if they have idle work.

Tested locally, though not tested while running hardware. Maximum backup threshold is 10hz, though probably this could go down to 100hz or even 1Khz and still get benefit. Backoff schedule could also be adjusted.

(cherry picked from commit 7ce6aefe3adacff19762bfa0cbf8b9701690cc37)

heliosfa commented 3 years ago

Some care needs to be taken in implementing this and we may even need to look at making it switchable.

@DrongoTheDog will need to confirm, but I believe the design intent here is that the MPI spinners spin as fast as possible to drain the MPI network as quickly as possible given that there could be large bursts of data that overwhelm the statically sized MPI buffers.

On a beecake of a system, this is not a huge issue, but I can definitely see issues on laptops/desktops with fewer cores or battery.

DrongoTheDog commented 3 years ago

On 14/06/2021 14:23, Graeme Bragg wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Some care needs to be taken in implementing this and we may even need to look at making it switchable.

@DrongoTheDog https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrongoTheDog&data=04%7C01%7Cadb%40soton.ac.uk%7C6982ce02389940e671ac08d92f37907d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637592737958118750%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OgMHeTgXJhkFPzJoJ5hEM87BolNZW9InKZe9QYJ9vYk%3D&reserved=0 will need to confirm, but I believe the design intent here is that the MPI spinners spin as fast as possible to drain the MPI network as quickly as possible given that there could be large bursts of data that overwhelm the statically sized MPI buffers.

The inter-process comms inside the Orchestrator is based on non-blocking (immediate) reads, so there is a message buffer. One can (and one did) chuck away the default buffer and attach ones own; this is done in the CommonBase constructor and it's currently set to 1GByte. (Notice the use of malloc/free; the standard says not to use new/delete or it'll all fall over. So I tried, obviously, and it all fell over so I went back to doing what I was told.)

When I chucked the source over the fence to MLV and GMB, there existed a vestigial (and subtly documented) POETS> test /flood=, command, which (amongst other things) induced a message storm in the MPI fabric, with a view to seeing how far I could reasonably abuse MPI and get away with it. The answer turned out to be quite a lot; you can quite comfortably (as far as MPI is concerned) exfiltrate many Mbytes of data in SGLs (sodding great lumps).

On a beecake of a system, this is not a huge issue, but I can definitely see issues on laptops/desktops with fewer cores or battery.

The only parameter that matters is memory footprint.

--

Andrew Brown Professor of Electronics University of Southampton Hampshire SO17 1BJ UK

T: 02380 593374 M: 07464 981171

m8pple commented 3 years ago

If the worry about a flood of messages arriving is a concern, (which I could see), then presumably the correct approach is to move to a blocking MPI receive once HaveIdleWork returns false. Maybe the documentation for HaveIdleWork could be changed to:

/*! Return true if there is something we want to do in OnIdle. If HaveIdleWork
  returns false it is guaranteed it will never return true again unless a message
  is received. */
virtual bool           HaveIdleWork();

That would allow the implementation to switch from MPI_Iprobe to MPI_Probe which would get the absolute minimum latency of receiving messages to avoid flooding, while also maximising the efficiency of the loop.

(Not saying this should be done, probably not worth the effort - just reset the delay to 0 if HaveIdleWork returns true, and then you get almost all the CPU reduction benefit.).

DrongoTheDog commented 3 years ago

Hang on, chaps. Let's get this into perspective.

The design intent is that MPISpinner brokers comms within the MPI universe, which is expected to be rare: user input, logserver traffic, commands from user to motherships and high-level responses back. (Recall that the console read sits in its own thread and injects messages into the spinner on an equal footing to everything else.) Root overloads CommonBase::OnIdle for the batch system to work, and so does LogServer, although I can't remember why.

Using blocking IO will almost certainly bite us in the bum at some point, and there will be a gnailing and a washing of teeth.

To shift huge slabs of data into (and then presumably out of) the MPI universe, MPIspinner is not the mechanism of choice: if you want to write to a datafile, let the motherships do it directly. Instrumentation packets need to go to a Monitor process, to be sure, but we haven't got one yet so it can't possibly be broken. GMB wants to talk to Externals directly via sockets (which I think is daft, but hey....)

It seems to me that the only driver for this is excessive CPU load and GMBs feet getting hot. OK, poke a 100mS delay in the loop and let's get on with our lives. Dynamic delay adjustment here is not worth worrying about.

ADB

On 15/06/2021 17:24, m8pple wrote:

CAUTION: This e-mail originated outside the University of Southampton.

If the worry about a flood of messages arriving is a concern, (which I could see), then presumably the correct approach is to move to a blocking MPI receive once HaveIdleWork returns false. Maybe the documentation for HaveIdleWork could be changed to:

|/! Return true if there is something we want to do in OnIdle. If HaveIdleWork returns false it is guaranteed it will never return true again unless a message is received. / virtual bool HaveIdleWork(); |

That would allow the implementation to switch from |MPI_Iprobe| to |MPI_Probe| which would get the absolute minimum latency of receiving messages to avoid flooding, while also maximising the efficiency of the loop.

(Not saying this should be done, probably not worth the effort - just reset the delay to 0 if HaveIdleWork returns true, and then you get almost all the CPU reduction benefit.).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Fpull%2F236%23issuecomment-861644289&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022595935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Jq0O86fA58AcgORWTiYE9Sp%2FjKvHRpZsn1aa0UVf2Qs%3D&reserved=0, or unsubscribe https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE7EYI2CWRIKJPLKBSZB4DDTS55FXANCNFSM46U47SQA&data=04%7C01%7Cadb%40soton.ac.uk%7C210702316c774a1b3bbf08d9301a1f9d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637593711022605931%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=urKSZL%2BN0Ind59IHWipIsll9cWcIp6R4MAs28p7prZY%3D&reserved=0.

--

Andrew Brown Professor of Electronics University of Southampton Hampshire SO17 1BJ UK

T: 02380 593374 M: 07464 981171

m8pple commented 3 years ago

I agree, I don't really think it is worth shifting to blocking or doing much else beyond the current approach, but:

It seems to me that the only driver for this is excessive CPU load and GMBs feet getting hot. OK, poke a 100mS delay in the loop and let's get on with our lives. Dynamic delay adjustment here is not worth worrying about.

I don't think this is a trivial issue that only applies to people running laptops, and should be taken seriously. Running the stock 1.0.0-alpha Orchestrator on Ayres (which has hardware), it immediately and permanently consumes 600% of CPU until it exits. Ayres only has 6 CPUs and 12 hyper-threads (IIRC, it is i7), so the orchestrator by default consumes 100% of the true CPU capacity of the server - any other useful task will have to compete with the spinners.

Some of those threads will probably switch to doing real things during various stages, but it's difficult to see how it's not going to make tasks like compose slower if all your parallel g++ processes have to compete with the spinners. It also means that you won't get a frequency boost on the CPUs that are doing the important work because the spinners are reducing the possible boost frequency. I guess at least they don't pollute the cache.

So yes - I think that the hacky delay fix is fine, but it's not just a case of avoiding laptops heating up or Graeme's feet getting warm. Spinning CPU cores wreck performance in a multi-core system due to lots of second order hardware and OS effects, and should be identified and removed.

mvousden commented 3 years ago

@heliosfa @m8pple what's blocking here?

POETSII / Orchestrator

Add HaveIdleWork to avoid excessive CPU cost for spinning. #236

--

--