JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

Memory access errors in Pluto.jl on Windows 1.6.0-beta #39270

Closed fonsp closed 3 years ago

fonsp commented 3 years ago

I am running tests for Pluto.jl on GitHub Actions with Windows on Julia 1.6.0-beta, and it fails ~50% of the time with an error that seems unrelated to my code. On previous Julia versions, Pluto's tests pass very consistently.

I believe that it is caused by starting multiple Distributed processes too quickly, which might only be a problem in our tests, but not in general Pluto use.

The 1.6-compatility PR is here: https://github.com/fonsp/Pluto.jl/pull/842 The change to support 1.6 was just a single line (unrelated to the failures), all other commits are to trigger a rerun of the tests.

To run these tests:

import Pkg
Pkg.activate(temp=true)
Pkg.add(name="Pluto", rev="julia-1.6-compat")
@time Pkg.test("Pluto")

Context

The tests fail in @testset "WorkspaceManager". This is the part of Pluto's codebase that uses Distributed to launch and control worker processes for notebooks. The code for Pluto.WorkspaceManager is here.

Summary of test results

On Windows 1.6, about half of the tests fail. On other OS/Julia versions all tests pass. Windows 1.6 failures fall in these categories:

My guess of what happened

Some of the test failures have a stack trace that point to: https://github.com/fonsp/Pluto.jl/blob/598bd4384f29444631a7c67da8da971ef545e4db/test/WorkspaceManager.jl#L31 This line (31), and the line before (30), both create a new Distributed process and initialize the notebook runner environment. This happens synchronously, but in quick succession. Lots of previous tests also created a new process, but this is the first test where it happens twice with little code inbetween.

Why I am posting this issue

I understand that this is far from a MWE, and I only have a vague idea of where the tests fail. My hopes are:

fonsp commented 3 years ago

My "guess of what happened" was wrong: the issue is still there after adding sleep calls.

I still don't know what is causing the CI failures, but I will close this issue until I have something more concrete. (My next guess is that the github computer ran out of memory due to some leak.)

(I released the next Pluto update which is Julia 1.6 compatible!)

Sorry for the trouble!