LMAX-Exchange / disruptor

High Performance Inter-Thread Messaging Library
http://lmax-exchange.github.io/disruptor/
Apache License 2.0
17.5k stars 3.93k forks source link

Lmax Disruptor Version 4.0.0 with MultiProducers blocks in ringbuffer.next() call #463

Open oanesbogdan opened 1 year ago

oanesbogdan commented 1 year ago

Hello Team,

I'm using Lmax Disruptor with 3 producers and 30 consumers with busy spin strategy. The ring buffer size is 65536. The producers are publishing events with a high rate of 20.000 events/second and after a while (when the disruptor get full) the disruptor hangs (the producers are blocked and the consumers are not processing anymore the events).

I have added below the thread dump for producers and consumers. Consumers stack traces:

                "Thread-72" - Thread t@124
                   java.lang.Thread.State: RUNNABLE
                        at app//com.lmax.disruptor.BusySpinWaitStrategy.waitFor(BusySpinWaitStrategy.java:39)
                        at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
                        at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
                        at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
                        at java.base@11.0.20/java.lang.Thread.run(Thread.java:829)

                   Locked ownable synchronizers:
                        - None

                "Thread-73" - Thread t@125
                   java.lang.Thread.State: RUNNABLE
                        at app//com.lmax.disruptor.BusySpinWaitStrategy.waitFor(BusySpinWaitStrategy.java:39)
                        at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
                        at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
                        at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
                        at java.base@11.0.20/java.lang.Thread.run(Thread.java:829)

                   Locked ownable synchronizers:
                        - None

Producers stack traces:

              "thread-context-16" - Thread t@142
                 java.lang.Thread.State: TIMED_WAITING
                      at java.base@11.0.20/jdk.internal.misc.Unsafe.park(Native Method)
                      at java.base@11.0.20/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:357)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
                      at app//com.lmax.disruptor.RingBuffer.next(RingBuffer.java:263)
              .............
              Locked ownable synchronizers:
                      - None
               ...........
              "pool-9-thread-1" - Thread t@144
                 java.lang.Thread.State: TIMED_WAITING
                      at java.base@11.0.20/jdk.internal.misc.Unsafe.park(Native Method)
                      at java.base@11.0.20/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:357)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
                      at app//com.lmax.disruptor.RingBuffer.next(RingBuffer.java:263)
              .............
              Locked ownable synchronizers:
                      - locked <15a0ebf2> (a java.util.concurrent.ThreadPoolExecutor$Worker)
              .............
              "DefaultQuartzScheduler_Worker-1" - Thread t@147
                 java.lang.Thread.State: TIMED_WAITING
                      at java.base@11.0.20/jdk.internal.misc.Unsafe.park(Native Method)
                      at java.base@11.0.20/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:357)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
                      at app//com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
                      at app//com.lmax.disruptor.RingBuffer.next(RingBuffer.java:263)
              .............
              Locked ownable synchronizers:
                      - None

The producers are using the following code snippet for publishing events:

              var ringBuffer = disruptor.getRingBuffer();
              final long sequence = ringBuffer.next();
              try {
                  Command command = disruptor.get(sequence);
                  command.setType("SAMPLE");
                  ....
              } finally {
                  disruptor.getRingBuffer().publish(sequence);
              }

and the producers are blocked in an endless loop inside ringBuffer.getNext() method.

The consumers are using a modulo load balancing for processing the events based on the index of command.

After blocking I can see the following values in the disruptor:

          disruptor.getRingBuffer().getCursor() = 86053
          disruptor.getRingBuffer().getBufferSize() = 65536
          disruptor.getRingBuffer().getMinimumGatingSequence() = 20517 
          disruptor.getRingBuffer().remainingCapacity() = 0 

Could you please help me to solve this issue?

Thank you in advance! Bogdan

grumpyjames commented 9 months ago

Wild guess: It looks like one of the consumers has failed to notify the ringbuffer that it has progressed past a particular sequence (the minimum gating sequence).

We'd need to see the code you're using to set up the consumers and publishers to know more.

aogburn commented 5 months ago

I note the issue description producers are at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136) , which is in effect a LockSupport.parkNanos(1) call. Was this issue produced on a RHEL8+ environment perhaps? If so, sounds like this is similar to https://github.com/LMAX-Exchange/disruptor/issues/219 and the increased CPU that short LockSupport.parkNanos intervals will face on RHEL 8+. I also recently found an issue where producers where consuming high CPU on com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136) when the buffer was full. Once triggered, this could perpetuate the issue state as multiple producers could potentially spin too much on that CPU busy wait and prevent consumers from getting appropriate CPU themselves to clear the buffers. It appears MultiProducerSequencer would need a larger/configurable interval like was done in the SleepingWaitStrategy.java or actually use any larger interval available from its wait strategy as noted in the TODO for this line.