crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.5k stars 1.62k forks source link

`spec/std/channel_spec.cr` gets stuck on Windows with `-Dpreview_mt` #14222

Open HertzDevil opened 10 months ago

HertzDevil commented 10 months ago

Running that spec file will block at the raises if channel was closed test. Increasing CRYSTAL_WORKERS from the default 4 makes two more specs run and pass, but after that the awakes all waiting selects test still blocks.

This reduction, probably the first Windows-specific one we have, similarly blocks at the CRYSTAL_WORKERS + 1-th loop:

100.times do |i|
  p i

  x = Atomic(Int32).new(0)

  spawn do
    x.set(1)

    spawn(same_thread: true) do
      while true
        Fiber.yield
      end
    end
  end

  while x.get == 0
    Fiber.yield
  end
end

The snippet doesn't block if run without -Dpreview_mt or run on Linux.

HertzDevil commented 3 months ago

It looks like all the worker threads eventually got stuck on the fiber_channel.receive line inside Crystal::Scheduler#run_loop. This fiber_channel is a Crystal::FiberChannel backed by ::IO.pipe. This honestly looks suspicious to me, since non-blocking pipes themselves depend on the event loop. Is a more lightweight option available here? cc @ysbaddaden @oprypin

ysbaddaden commented 3 months ago

All threads waiting on FiberChannel means all threads "parked" themselves but they shouldn't be blocked on it, but blocked on the EventLoop (waiting for completions).

So yeah, suspicious.

Note: with Executioncontext the FiberChannel is no longer used.

HertzDevil commented 3 months ago

With two changes I now manage to get the OP snippet and the specs passing:

module Crystal::System::FileDescriptor
  def self.pipe(read_blocking, write_blocking)
    # ...

    r = IO::FileDescriptor.new(r_pipe.address, read_blocking)
    w = IO::FileDescriptor.new(w_pipe.address, write_blocking)
    r.read_buffering = false # disable buffering on the read end
    w.sync = true

    {r, w}
  end
end

lib LibC
  fun PeekNamedPipe(hNamedPipe : HANDLE, lpBuffer : Void*, nBufferSize : DWORD, lpBytesRead : DWORD*, lpTotalBytesAvail : DWORD*, lpBytesLeftThisMessage : DWORD*) : BOOL
end

struct Crystal::FiberChannel
  def receive
    # explicitly peek the pipe until there is data available
    while true
      LibC.PeekNamedPipe(LibC::HANDLE.new(@worker_out.fd), nil, 0, nil, out bytes_avail, nil)
      break if bytes_avail >= sizeof(UInt64)
      Fiber.yield
    end

    oid = @worker_out.read_bytes(UInt64)
    Pointer(Fiber).new(oid).as(Fiber)
  end
end

What's interesting is both ends of the IO.pipe can now be blocking, which makes me believe pipes are indeed unnecessary here. Everything still works if I wrote:

struct Crystal::FiberChannel
  @lock = Crystal::SpinLock.new

  def initialize
    @queue = Deque(Fiber).new
  end

  def send(fiber : Fiber)
    @lock.sync { @queue << fiber }
  end

  def receive
    while true
      fiber = @lock.sync { @queue.shift? }
      return fiber if fiber
      Fiber.yield
    end
  end
end

And eventually:

class Crystal::Scheduler
  {% flag?(:preview_mt) %}
    def run_loop
      spawn_stack_pool_collector

      loop do
        @lock.lock

        if runnable = @runnables.shift?
          @runnables << Fiber.current
          @lock.unlock
          resume(runnable)
        else
          @lock.unlock
          Fiber.yield
        end
      end
    end

    def send_fiber(fiber : Fiber)
      @lock.sync { @runnables << fiber }
    end
  {% end %}
end

Has the codebase changed so much in these years that we don't need fiber_channel anymore...? The pipe was already there since https://github.com/crystal-lang/crystal/commit/b52bfb7aeb213c8f5110620a4d3919b10107d865 as part of #8112, but there were no mentions of the rationale behind using a pipe.

On the other hand, I wonder if this also implies IO.pipe's semantics are underspecified.