KirillOsenkov / MSBuildStructuredLog

A logger for MSBuild that records a structured representation of executed targets, tasks, property and item values.
MIT License
1.41k stars 188 forks source link

Severe lock contention in BlockingCollection in BinLogReader #684

Closed rolfbjarne closed 8 months ago

rolfbjarne commented 1 year ago

The BlockingCollection in BinLogReader suffers from severe lock contention:

https://github.com/KirillOsenkov/MSBuildStructuredLog/blob/786b9739f97b66428a2dde45346d99c47098ad88/src/StructuredLogger/BinaryLogger/BinLogReader.cs#L71-L74

I added some debug code, and the initial results were:

Read 24924 records in 00:00:06.8023352 = 6802.3352 ms

raising the bounded capacity from 5000 to 50000, I get:

Read 24924 records in 00:00:00.3312354 = 331.2354 ms

which means that >95% of the time spent loading the binlog is spent waiting for a lock.

This is on macOS with the Avalonia version, but I'm guessing something similar would happen elsewhere as well.

yuehuang010 commented 1 year ago

This a buffer to smooth out consumer throughput. The consumer (constructor) is most likely a bottlenecked by a very expansive event. Would like to know which event and then try to speed it up or moved to a background thread.

KirillOsenkov commented 1 year ago

My understanding is that the consumer is now parallel, and it's waiting on the producer.

We can try just increasing the bounded capacity, this way the producer will fill a larger buffer and give the consumer more data to burn through faster.

yuehuang010 commented 1 year ago

Only the ProjectEvents are spun into a seperate thread because those process a lot of properties. The rest are remain serialized. My last profile, the ProjectEvents took up over 60% of the CPU time (when running in series). There still might be other that could be spun out like Tasks or Targets events.

KirillOsenkov commented 1 year ago

So in your opinion, who is slowing things down? The producer or the consumer?

yuehuang010 commented 1 year ago

Both 😆 . The producer is more hardware limited as exposed by Apple's better hardware. The consumer is algorithm limited as all the processing in the CPU.

Increasing the boundedCapacity will unblock for now until someone creates a large enough binlog to hit the cap again.

KirillOsenkov commented 1 year ago

Rolf, we're not seeing a wall clock time improvement on Windows after bumping the buffer from 5000 to 50000. I did make the change but will need more thinking about it.

rolfbjarne commented 1 year ago

Would just removing the boundedCapacity be a bad idea?

KirillOsenkov commented 1 year ago

Don’t know until we measure!

KirillOsenkov commented 8 months ago

OK I've done more research and it looks like it's bounded by the consumer. The producer is reading records faster than the consumer can process them and place them in the tree. I've tried removing the buffer capacity and it just keeps growing, keeping the buffer saturated to the full.

What this means is that as long as we keep around 50,000 records in the buffer, the consumer will be saturated with work to the max, and will never stall having run out of records to process.

Buffer being empty would mean that the consumer is faster than the producer. However this is not what I'm seeing on Windows on my hardware.

Next steps: optimize the consumer :) Good to know that the producer is not the bottleneck.

KirillOsenkov commented 8 months ago

I've done a lot of optimizations in the past week, and a lot of measurements.

I concluded that increasing the buffer is unlikely to help much, if at all.

If the buffer is hovering near 0, it means the producer is the bottleneck, and the consumer is idling anyway. In this case increasing the buffer won't make the producer fill it any faster, it'll just sit mostly empty.

If the buffer is hovering near the top end (currently 50,000), it means the consumer is the bottleneck, and the producer is able to saturate the buffer and sits idle otherwise. This means increasing the buffer will not make the consumer drain it any faster, since it already has 50,000 items ready to be fed into it.

The buffer is really only useful when the consumer and the producer quickly alternate between which one is the bottleneck. If the producer is momentarily faster and the buffer is empty, it will fill the buffer, thus "storing potential energy" for the consumer. Then, when the tide changes and the consumer is suddenly faster, it has a full buffer waiting for it, so it doesn't stall, but drains the buffer instead, doing useful work.

In my testing I usually see the producer being slow in the first part (when it reads all the strings), and towards the end the buffer gets full and the consumer gets behind instead (because strings are rarely read towards the end, most strings have been mentioned by then).

I'll close this issue, but if in local testing you see that increasing the buffer further from 50,000 really helps with end-to-end performance, I'll be happy to investigate what is wrong. I only did testing on my hardware, so it could be that one gets different results on different hardware.

KirillOsenkov commented 8 months ago

Also @rolfbjarne I found a problem with the way you're measuring things here: https://gist.github.com/rolfbjarne/87a09825dffe9c415183582c74934cd5

You're basically measuring the wall clock time that it takes the producer to read all the records. If you add a large buffer, the producer will just fill the buffer and be done sooner. This is when you'll see drastically better numbers. However the consumer will still take up just the same amount of time draining that queue. It'll still perform the same work processing each item, so the end-to-end time for the user will not change.

The best is to just observe the elapsed binlog loading time in the UI: image

rolfbjarne commented 8 months ago

It'll still perform the same work processing each item, so the end-to-end time for the user will not change.

The measurement method might not be correct/ideal, but I saw a dramatic change in the time it took to load the binlog in the UI when I increased the bound capacity.

KirillOsenkov commented 8 months ago

If you still can reproduce this with the latest main branch, where I’ve done a ton of optimizations, send me the binlog, and the time to load it with old and new, and what buffer size you used.

It could be that the OS makes a difference, the hardware, or the specific binlog you’re testing with. Happy to investigate more if you show me the data :)