I am running tests for Pluto.jl on GitHub Actions with Windows on Julia 1.6.0-beta, and it fails ~50% of the time with an error that seems unrelated to my code. On previous Julia versions, Pluto's tests pass very consistently.

I believe that it is caused by starting multiple Distributed processes too quickly, which might only be a problem in our tests, but not in general Pluto use.

The 1.6-compatility PR is here: https://github.com/fonsp/Pluto.jl/pull/842 The change to support 1.6 was just a single line (unrelated to the failures), all other commits are to trigger a rerun of the tests.

To run these tests:

import Pkg
Pkg.activate(temp=true)
Pkg.add(name="Pluto", rev="julia-1.6-compat")
@time Pkg.test("Pluto")

Context

The tests fail in @testset "WorkspaceManager". This is the part of Pluto's codebase that uses Distributed to launch and control worker processes for notebooks. The code for Pluto.WorkspaceManager is here.

Summary of test results

On Windows 1.6, about half of the tests fail. On other OS/Julia versions all tests pass. Windows 1.6 failures fall in these categories:

the WorkspaceManager tests failed after the 6hour timeout with a ReadOnlyMemoryError() ([1], [2], [3])
the WorkspaceManager tests failed with an EXCEPTION_ACCESS_VIOLATION ([1])
the WorkspaceManager tests failed with an InitError(mod=:Profile, error=ErrorException("could not allocate space for 10000000 instruction pointers")) ([1])
the WorkspaceManager tests failed with an OutOfMemoryError([1])
some tests have stalled, but I think they will fall in the first category after the timeout ([1], [2])

My guess of what happened

Some of the test failures have a stack trace that point to: https://github.com/fonsp/Pluto.jl/blob/598bd4384f29444631a7c67da8da971ef545e4db/test/WorkspaceManager.jl#L31 This line (31), and the line before (30), both create a new Distributed process and initialize the notebook runner environment. This happens synchronously, but in quick succession. Lots of previous tests also created a new process, but this is the first test where it happens twice with little code inbetween.

Why I am posting this issue

I understand that this is far from a MWE, and I only have a vague idea of where the tests fail. My hopes are:

Perhaps this type of error looks familiar, or you have an idea of what changed between 1.5.3 and 1.6.0 that could cause it.
Pointers on what debugging steps to try next would be helpful. Maybe we can create a stress-test for starting and initializing Distributed processes?

JuliaLang / julia

Memory access errors in Pluto.jl on Windows 1.6.0-beta #39270

Context

Summary of test results

My guess of what happened

Why I am posting this issue