facebookresearch / nle

The NetHack Learning Environment
Other
939 stars 113 forks source link

Bus Error Entering DLVL 10-12 #333

Closed paulkent-um closed 1 year ago

paulkent-um commented 1 year ago

🐛 Bug

Something about the NLE hits a bus error some of the time that the agent enters a floor between 10 and 12 of the main dungeon, or 7-9 in dungeon 2 (which I think is the Gnomish Mines?)

To Reproduce

Steps to reproduce the behavior:

  1. Go to nle/nle/env/tasks.py and temporarily comment out line 327, the line that prevents a NetHackChallenge env from being created with wizard mode enabled.
  2. Run the following code and watch it crash with a bus error:
#!/usr/bin/env python3
import nle
import gym
import aicrowd_gym

keyLookup = {
    "a" : 24,
    "b" : 6,
    "c" : 30,
    "d" : 33,
    "e" : 35,
    "f" : 40,
    "g" : 72,
    "h" : 3,
    "i" : 44,
    "j" : 2,
    "k" : 0,
    "l" : 1,
    "m" : 54,
    "n" : 5,
    "o" : 57,
    "p" : 60,
    "q" : 64,
    "r" : 67,
    "s" : 75,
    "t" : 91,
    "u" : 4,
    "v" : 98,
    "w" : 102,
    "x" : 87,
    "y" : 7,
    "z" : 104,
    "A" : 89,
    "B" : 14,
    "C" : 27,
    "D" : 34,
    "E" : 36,
    "F" : 39,
    "G" : 73,
    "H" : 11,
    "I" : 45,
    "J" : 10,
    "K" : 8,
    "L" : 9,
    "M" : 55,
    "N" : 13,
    "O" : 58,
    "P" : 63,
    "Q" : 66,
    "R" : 69,
    "S" : 74,
    "T" : 88,
    "U" : 12,
    "V" : 43,
    "W" : 99,
    "X" : 95,
    "Y" : 15,
    "Z" : 28,
    "." : 18,
    "," : 61,
    "<" : 16,
    ">" : 17,
    ":" : 51,
    "0" : 110,
    "1" : 111,
    "2" : 112,
    "3" : 113,
    "4" : 114,
    "5" : 115,
    "6" : 116,
    "7" : 117,
    "8" : 118,
    "9" : 119,
    "$" : 120,
    "+" : 105,
    "-" : 106,
    " " : 107,
    "*" : 76,
    "#" : 20,
    "~" : 19, # represents enter
    "\\" : 38 # represents escape
}

env = aicrowd_gym.make("NetHackChallenge-v0", wizard=True, savedir=None)

sequence = "#wizlevelport~10~"

for x in range(1000):
    print(x)
    env.reset()
    for y in sequence:
        env.step(keyLookup[y])

Environment

NLE version: 0.8.1 PyTorch version: 1.11.0 Is debug build: No CUDA used to build PyTorch: None

OS: Mac OSX 12.1 GCC version: Could not collect CMake version: Could not collect

Python version: 3.8 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.8 [pip3] numpy==1.22.4 [pip3] torch==1.11.0 [conda] Could not collect

Additional context

If there's any more information you need, I'll try my best to provide it, but my ability to troubleshoot this problem is very limited. The bus error crashes my debugger without giving me a stack trace or anything, and execution seems to get passed to _pynethack.cpython-38-darwin.so, which is not a file format my IDE supports.

heiner commented 1 year ago

Backtrace for this issue. I've added a test for this in #334.

(lldb) target create --core "/cores/core.97379"
Core file '/cores/core.97379' (arm64) was loaded.
(lldb) bt
* thread #1
  * frame #0: 0x00000001b00c6d98 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001b00fbee0 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001afffe680 libsystem_c.dylib`raise + 32
    frame #3: 0x00000001b01134a4 libsystem_platform.dylib`_sigtramp + 56
    frame #4: 0x000000011eb9200c tmpsgbphos6libnethack.so`walkfrom(x=35, y=15, typ='\0') at mkmaze.c:1195:15 [opt]
    frame #5: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=35, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #6: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=37, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #7: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=39, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #8: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=41, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #9: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=43, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #10: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=45, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #11: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=47, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #12: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=49, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #13: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=49, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #14: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=51, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #15: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=53, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #16: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=55, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #17: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=57, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #18: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=59, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #19: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=61, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #20: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=61, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #21: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=61, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #22: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=59, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #23: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=59, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #24: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=57, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #25: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=55, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #26: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=53, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #27: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=51, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #28: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=51, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #29: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=51, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #30: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=49, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #31: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=47, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #32: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=45, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #33: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=45, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #34: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=45, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #35: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=43, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #36: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=43, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #37: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=43, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #38: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=41, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #39: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=39, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #40: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=37, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #41: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=37, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #42: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=35, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #43: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=33, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #44: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=31, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #45: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=29, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #46: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=29, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #47: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=27, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #48: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=27, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #49: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=25, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #50: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=25, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #51: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=25, y=7, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #52: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=25, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #53: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=23, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #54: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=21, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #55: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=19, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #56: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=17, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #57: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=15, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #58: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=15, y=7, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #59: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=17, y=7, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #60: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=17, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #61: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=15, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #62: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=13, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #63: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #64: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #65: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=13, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #66: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=13, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #67: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #68: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=9, y=15, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #69: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=9, y=13, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #70: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=9, y=11, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #71: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=9, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #72: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=9, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #73: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=7, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #74: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=13, y=7, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #75: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=13, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #76: 0x000000011eb91ef0 tmpsgbphos6libnethack.so`walkfrom(x=11, y=5, typ='\x18') at mkmaze.c:1199:9 [opt]
    frame #77: 0x000000011ec357ec tmpsgbphos6libnethack.so`sp_level_coder [inlined] spo_mazewalk(coder=0x0000600003de8100) at sp_lev.c:4793:5 [opt]
    frame #78: 0x000000011ec35714 tmpsgbphos6libnethack.so`sp_level_coder(lvl=<unavailable>) at sp_lev.c:5494:13 [opt]
    frame #79: 0x000000011ec2d8e8 tmpsgbphos6libnethack.so`load_special(name=<unavailable>) at sp_lev.c:6055:18 [opt]
    frame #80: 0x000000011eb92248 tmpsgbphos6libnethack.so`makemaz(s=<unavailable>) at mkmaze.c:1014:13 [opt]
    frame #81: 0x000000011eb8aef8 tmpsgbphos6libnethack.so`mklev at mklev.c:0 [opt]
    frame #82: 0x000000011eb8acd4 tmpsgbphos6libnethack.so`mklev at mklev.c:1004:5 [opt]
    frame #83: 0x000000011eb0f2c0 tmpsgbphos6libnethack.so`goto_level(newlevel=0x0000000104e57748, at_stairs=<unavailable>, falling='\0', portal='\0') at do.c:1448:9 [opt]
    frame #84: 0x000000011eb0fe54 tmpsgbphos6libnethack.so`deferred_goto at do.c:1756:9 [opt]
    frame #85: 0x000000011ec453ec tmpsgbphos6libnethack.so`level_tele at teleport.c:1025:9 [opt]
    frame #86: 0x000000011eaee318 tmpsgbphos6libnethack.so`wiz_level_tele at cmd.c:946:9 [opt]
    frame #87: 0x000000011eaf17d8 tmpsgbphos6libnethack.so`rhack(cmd="\U00000016") at cmd.c:4929:23 [opt]
    frame #88: 0x000000011eac4574 tmpsgbphos6libnethack.so`moveloop(resuming=<unavailable>) at allmain.c:0 [opt]
    frame #89: 0x000000011ec8d53c tmpsgbphos6libnethack.so`unixmain(argc=1, argv=0x0000000104e57fd0) at unixmain.c:354:5 [opt]
    frame #90: 0x000000011ebbda68 tmpsgbphos6libnethack.so`mainloop(ctx_transfer=<unavailable>) at nle.c:195:5 [opt]
    frame #91: 0x000000011ec9f928 tmpsgbphos6libnethack.so`make_fcontext at make_arm64_aapcs_macho_gas.S:60
heiner commented 1 year ago

This turns out to be an issue of a literal stack overflow (for our stack-on-the-heap that we context switch to) due to a larger than usual stack. Will update #334 with a fix.