facebookresearch / nle

The NetHack Learning Environment
Other
942 stars 113 forks source link

Process crashes when using two NLE instances sequentially (on MacOS for Debug Builds). #254

Open heiner opened 2 years ago

heiner commented 2 years ago

🐛 Bug

The test in #253 should pass but fails on MacOS for Debug builds.

To Reproduce

import random

import gym
import nle

ACTIONS = [0, 1, 2]

def main():
    envs = [gym.make("NetHackScore-v0") for _ in range(2)]

    env, *queue = envs
    env.reset()

    num_resets = 1

    while num_resets < 10:
        _, _, done, _ = env.step(random.choice(ACTIONS))
        if done:
            print("one env done")
            queue.append(env)
            env = queue.pop(0)
            print("about to reset one env")
            env.reset()
            num_resets += 1

main()

Environment

Collecting environment information... NLE version: 0.7.3+08b9280 PyTorch version: 1.9.0 Is debug build: No CUDA used to build PyTorch: None

OS: Mac OSX 11.5.1 GCC version: Could not collect CMake version: version 3.20.0

Python version: 3.8 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] numpysane==0.34 [pip3] torch==1.9.0 [conda] blas 1.0 mkl [conda] mkl 2019.4 233 [conda] mkl-service 2.3.0 py38h9ed2024_0 [conda] mkl_fft 1.3.0 py38ha059aab_0 [conda] mkl_random 1.1.1 py38h959d312_0 [conda] pytorch 1.9.0 py3.8_0 pytorch

heiner commented 2 years ago

This appears to only trigger on my personal machine, not on CI or for anyone else. Closing for now.

heiner commented 2 years ago

OK, this does break on CI as well, but only (1) on MacOS, and (2) when using a Debug build: https://github.com/facebookresearch/nle/runs/4359406818?check_suite_focus=true

heiner commented 2 years ago

Issue demonstrated in #290.

heiner commented 2 years ago

Related to the dlopen/dlclose dance not actually closing in this specific case (which it never guaranteed to do), as in this issue.

Possible solution: https://gist.github.com/heiner/bc78064fec32174e1a216dbd5fbc6503

JupiLogy commented 1 year ago

Hi, just wondering if it crashes with an error message at all. I'm getting a Segmentation fault when running nle, specifically when the Nethack.reset() function is called - though it's not every time. Not sure if it's a separate issue. My MWE frustratingly didn't have the issue.

EDIT: I got it working by reducing the action space as I noticed it was specifically happening when executing specific actions.