allenai / ai2thor

An open-source platform for Visual AI.
http://ai2thor.allenai.org
Apache License 2.0
1.13k stars 215 forks source link

Segfault in AI2-Thor Docker #645

Open SamNPowers opened 3 years ago

SamNPowers commented 3 years ago

I'm seeing this error in the Player.log file:

=================================================================                                                                         [0/1846]Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

Caught fatal signal - signo:11 code:1 errno:0 addr:0x7f927cb4d00c
Obtained 12 stack frames.
#0  0x007f927a1e78a0 in funlockfile
#1  0x007f92757a2c6d in _nv044glcore
#2  0x007f927570e714 in _nv044glcore
#3  0x007f927b09c76c in ApiGLES::SetVertexArrayAttrib(unsigned int, unsigned int, VertexFormat, unsigned char, unsigned int, void const*)
#4  0x007f927b07c6fb in SetVertexStateGLES(ShaderChannelMask, VertexChannelsInfo const&, GfxBuffer* const*, unsigned int const*, int, unsigned int
, unsigned long)
#5  0x007f927b0884b6 in GfxDeviceGLES::DrawBuffers(GfxBuffer*, unsigned int, GfxBuffer* const*, unsigned int const*, int, DrawBuffersRange const*,
 int, VertexDeclaration*)
#6  0x007f927b049032 in GfxDeviceWorker::RunCommand(ThreadedStreamBuffer&)
#7  0x007f927b04971b in GfxDeviceWorker::RunExt(ThreadedStreamBuffer&)
#8  0x007f927b03f2f5 in GfxDeviceWorker::RunGfxDeviceWorker(void*)
#9  0x007f927b44561a in Thread::RunThreadWrapper(void*)
#10 0x007f927a1dc6db in start_thread
#11 0x007f9279f05a3f in clone

I'm running on AWS using a modified version of https://github.com/allenai/ai2thor-docker. It seems to work sometimes (no segfault), but I haven't exactly pinned down when it does and doesn't work. Any advice on further steps to take? Thanks!

ekolve commented 3 years ago

Are you running more than one Xorg server on this host? Do you have more than one instance of ai2thor-docker running? We have seen instances where running more than one Xorg server causes failures.

SamNPowers commented 3 years ago

Ah okay, I do have more than one ai2thor-docker container running.

Just now I tried to run multiple experiments within the same docker container. The first seemed to start properly, but during the second, it failed with returncode 1, and the Player.log gives:

Mono path[0] = '/root/.ai2thor/releases/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b_Data/MonoBleedingEdg\
e/etc'
Display 0 'NVIDIA VGX  32"': 1024x768 (primary device).
Display 1 'NVIDIA VGX  32"': 1024x768 (secondary device).
Display 2 'NVIDIA VGX  32"': 1024x768 (secondary device).
Display 3 'NVIDIA VGX  32"': 1024x768 (secondary device).
Desktop is 1024 x 768 @ 170 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting
(Filename:  Line: 618)

Is there a recommended way to run multiple experiments on the same machine? Different docker containers aren't working, and using the same one is having issues too. (A kind of confusing issue, though - why would a renderer not be found because of parallelization? Perhaps something else is happening.)

(Thanks for the quick response!)

ekolve commented 3 years ago

Are you running multiple ai2thor processes within a single docker container? That would be one way to achieve parallelization.

SamNPowers commented 3 years ago

I am, sorry for not being clear. That's when I got the error in the previous message.

ekolve commented 3 years ago

I think in the near term it may make the most sense to run directly on the host (outside of docker). The only thing you should need to do is start an Xorg server which can be done with the script below.:

import atexit
import os
import platform
import re
import shlex
import subprocess
import tempfile

def pci_records():
    records = []
    command = shlex.split("lspci -vmm")
    output = subprocess.check_output(command).decode()

    for devices in output.strip().split("\n\n"):
        record = {}
        records.append(record)
        for row in devices.split("\n"):
            key, value = row.split("\t")
            record[key.split(":")[0]] = value

    return records

def generate_xorg_conf(devices):
    xorg_conf = []

    device_section = """
Section "Device"
    Identifier     "Device{device_id}"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "{bus_id}"
EndSection
"""
    server_layout_section = """
Section "ServerLayout"
    Identifier     "Layout0"
    {screen_records}
EndSection
"""
    screen_section = """
Section "Screen"
    Identifier     "Screen{screen_id}"
    Device         "Device{device_id}"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    SubSection     "Display"
        Depth       24
        Virtual 1024 768
    EndSubSection
EndSection
"""
    screen_records = []
    for i, bus_id in enumerate(devices):
        xorg_conf.append(device_section.format(device_id=i, bus_id=bus_id))
        xorg_conf.append(screen_section.format(device_id=i, screen_id=i))
        screen_records.append(
            'Screen {screen_id} "Screen{screen_id}" 0 0'.format(screen_id=i)
        )

    xorg_conf.append(
        server_layout_section.format(screen_records="\n    ".join(screen_records))
    )

    output = "\n".join(xorg_conf)
    return output

def startx(display=0):
    if platform.system() != "Linux":
        raise Exception("Can only run startx on linux")

    devices = []
    for r in pci_records():
        if r.get("Vendor", "") == "NVIDIA Corporation" and r['Device'] != 'GK107GL [Quadro K420]' and r["Class"] in [
            "VGA compatible controller",
            "3D controller",
        ]:
            bus_id = "PCI:" + ":".join(
                map(lambda x: str(int(x, 16)), re.split(r"[:\.]", r["Slot"]))
            )
            devices.append(bus_id)

    if not devices:
        raise Exception("no nvidia cards found")

    fd, path = tempfile.mkstemp()
    try:
        with open(path, "w") as f:
            f.write(generate_xorg_conf(devices))
        command = shlex.split(
            "sudo Xorg -quiet -maxclients 1024 -noreset +extension GLX +extension RANDR +extension RENDER -config %s :%s"
            % (path, display)
        )
        proc = subprocess.Popen(command)
        atexit.register(lambda: proc.poll() is None and proc.kill())
        proc.wait()
    finally:
        os.close(fd)
        os.unlink(path)

startx()
SamNPowers commented 3 years ago

When I run in the host (no docker at all), and start the x server myself using your script, the first instance starts correctly. The second failed out again with returncode -6 and Player.log:

Mono path[0] = '/home/ubuntu/.ai2thor/releases/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b_Data/Managed'
Mono config path = '/home/ubuntu/.ai2thor/releases/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b_Data/MonoBleedingEdge/etc'
Display 0 'NVIDIA VGX  32"': 1024x768 (primary device).
Display 1 'NVIDIA VGX  32"': 1024x768 (secondary device).
Display 2 'NVIDIA VGX  32"': 1024x768 (secondary device).
Display 3 'NVIDIA VGX  32"': 1024x768 (secondary device).
Desktop is 1024 x 768 @ 170 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
[Vulkan init] extensions: count=15
[Vulkan init] extensions: name=VK_KHR_device_group_creation, enabled=0
[Vulkan init] extensions: name=VK_KHR_display, enabled=1
[Vulkan init] extensions: name=VK_KHR_external_fence_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_memory_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_semaphore_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_get_physical_device_properties2, enabled=0
[Vulkan init] extensions: name=VK_KHR_get_surface_capabilities2, enabled=0
[Vulkan init] extensions: name=VK_KHR_surface, enabled=1
[Vulkan init] extensions: name=VK_KHR_xcb_surface, enabled=0
[Vulkan init] extensions: name=VK_KHR_xlib_surface, enabled=1
[Vulkan init] extensions: name=VK_EXT_acquire_xlib_display, enabled=0
[Vulkan init] extensions: name=VK_EXT_debug_report, enabled=0
[Vulkan init] extensions: name=VK_EXT_debug_utils, enabled=0
[Vulkan init] extensions: name=VK_EXT_direct_mode_display, enabled=0
[Vulkan init] extensions: name=VK_EXT_display_surface_counter, enabled=0
Vulkan detection: 2
Initialize engine version: 2019.4.2f1 (20b4642a3455)
[Subsystems] Discovering subsystems at path /home/ubuntu/.ai2thor/releases/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b/thor-Linux64-422ec26508981befae54e9e39ffb2add3beab58b_Data/UnitySubsystems
GfxDevice: creating device client; threaded=1
Unable to find a supported OpenGL core profile
Unable to find a supported OpenGL core profile
GfxDevice: creating device client; threaded=1
[Vulkan init] extensions: count=15
[Vulkan init] extensions: name=VK_KHR_device_group_creation, enabled=0
[Vulkan init] extensions: name=VK_KHR_display, enabled=1
[Vulkan init] extensions: name=VK_KHR_external_fence_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_memory_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_semaphore_capabilities, enabled=0
[Vulkan init] extensions: name=VK_KHR_get_physical_device_properties2, enabled=0
[Vulkan init] extensions: name=VK_KHR_get_surface_capabilities2, enabled=0
[Vulkan init] extensions: name=VK_KHR_surface, enabled=1
[Vulkan init] extensions: name=VK_KHR_xcb_surface, enabled=0
[Vulkan init] extensions: name=VK_KHR_xlib_surface, enabled=1
[Vulkan init] extensions: name=VK_EXT_acquire_xlib_display, enabled=0
[Vulkan init] extensions: name=VK_EXT_debug_report, enabled=0
[Vulkan init] extensions: name=VK_EXT_debug_utils, enabled=0
[Vulkan init] extensions: name=VK_EXT_direct_mode_display, enabled=0
[Vulkan init] extensions: name=VK_EXT_display_surface_counter, enabled=0
[Vulkan init] Graphics queue count=1
[Vulkan init] extensions: count=128
[Vulkan init] extensions: name=VK_KHR_16bit_storage, enabled=0
[Vulkan init] extensions: name=VK_KHR_8bit_storage, enabled=0
[Vulkan init] extensions: name=VK_KHR_bind_memory2, enabled=0
[Vulkan init] extensions: name=VK_KHR_buffer_device_address, enabled=0
[Vulkan init] extensions: name=VK_KHR_copy_commands2, enabled=0
[Vulkan init] extensions: name=VK_KHR_create_renderpass2, enabled=0
[Vulkan init] extensions: name=VK_KHR_dedicated_allocation, enabled=1
[Vulkan init] extensions: name=VK_KHR_deferred_host_operations, enabled=0
[Vulkan init] extensions: name=VK_KHR_depth_stencil_resolve, enabled=0
[Vulkan init] extensions: name=VK_KHR_descriptor_update_template, enabled=1
[Vulkan init] extensions: name=VK_KHR_device_group, enabled=0
[Vulkan init] extensions: name=VK_KHR_draw_indirect_count, enabled=0
[Vulkan init] extensions: name=VK_KHR_driver_properties, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_fence, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_fence_fd, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_memory, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_memory_fd, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_semaphore, enabled=0
[Vulkan init] extensions: name=VK_KHR_external_semaphore_fd, enabled=0
[Vulkan init] extensions: name=VK_KHR_fragment_shading_rate, enabled=0
[Vulkan init] extensions: name=VK_KHR_get_memory_requirements2, enabled=1
[Vulkan init] extensions: name=VK_KHR_image_format_list, enabled=1
[Vulkan init] extensions: name=VK_KHR_imageless_framebuffer, enabled=0
[Vulkan init] extensions: name=VK_KHR_maintenance1, enabled=1
[Vulkan init] extensions: name=VK_KHR_maintenance2, enabled=1
[Vulkan init] extensions: name=VK_KHR_maintenance3, enabled=0
[Vulkan init] extensions: name=VK_KHR_multiview, enabled=1
[Vulkan init] extensions: name=VK_KHR_pipeline_executable_properties, enabled=0
[Vulkan init] extensions: name=VK_KHR_pipeline_library, enabled=0
[Vulkan init] extensions: name=VK_KHR_push_descriptor, enabled=0
[Vulkan init] extensions: name=VK_KHR_relaxed_block_layout, enabled=0
[Vulkan init] extensions: name=VK_KHR_sampler_mirror_clamp_to_edge, enabled=1
[Vulkan init] extensions: name=VK_KHR_sampler_ycbcr_conversion, enabled=0
[Vulkan init] extensions: name=VK_KHR_separate_depth_stencil_layouts, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_atomic_int64, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_clock, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_draw_parameters, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_float16_int8, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_float_controls, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_non_semantic_info, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_subgroup_extended_types, enabled=0
[Vulkan init] extensions: name=VK_KHR_shader_terminate_invocation, enabled=0
[Vulkan init] extensions: name=VK_KHR_spirv_1_4, enabled=0
[Vulkan init] extensions: name=VK_KHR_storage_buffer_storage_class, enabled=0
[Vulkan init] extensions: name=VK_KHR_swapchain, enabled=1
[Vulkan init] extensions: name=VK_KHR_swapchain_mutable_format, enabled=0
[Vulkan init] extensions: name=VK_KHR_timeline_semaphore, enabled=0
[Vulkan init] extensions: name=VK_KHR_uniform_buffer_standard_layout, enabled=0
[Vulkan init] extensions: name=VK_KHR_variable_pointers, enabled=0
[Vulkan init] extensions: name=VK_KHR_vulkan_memory_model, enabled=0
[Vulkan init] extensions: name=VK_EXT_4444_formats, enabled=0
[Vulkan init] extensions: name=VK_EXT_blend_operation_advanced, enabled=0
[Vulkan init] extensions: name=VK_EXT_buffer_device_address, enabled=0
[Vulkan init] extensions: name=VK_EXT_calibrated_timestamps, enabled=0
[Vulkan init] extensions: name=VK_EXT_conditional_rendering, enabled=0
[Vulkan init] extensions: name=VK_EXT_conservative_rasterization, enabled=0
[Vulkan init] extensions: name=VK_EXT_custom_border_color, enabled=0
[Vulkan init] extensions: name=VK_EXT_depth_clip_enable, enabled=0
[Vulkan init] extensions: name=VK_EXT_depth_range_unrestricted, enabled=0
[Vulkan init] extensions: name=VK_EXT_descriptor_indexing, enabled=0
[Vulkan init] extensions: name=VK_EXT_discard_rectangles, enabled=0
[Vulkan init] extensions: name=VK_EXT_display_control, enabled=0
[Vulkan init] extensions: name=VK_EXT_extended_dynamic_state, enabled=0
[Vulkan init] extensions: name=VK_EXT_external_memory_host, enabled=0
[Vulkan init] extensions: name=VK_EXT_fragment_shader_interlock, enabled=0
[Vulkan init] extensions: name=VK_EXT_global_priority, enabled=0
[Vulkan init] extensions: name=VK_EXT_host_query_reset, enabled=0
[Vulkan init] extensions: name=VK_EXT_image_robustness, enabled=0
[Vulkan init] extensions: name=VK_EXT_index_type_uint8, enabled=0
[Vulkan init] extensions: name=VK_EXT_inline_uniform_block, enabled=0
[Vulkan init] extensions: name=VK_EXT_line_rasterization, enabled=0
[Vulkan init] extensions: name=VK_EXT_memory_budget, enabled=0
[Vulkan init] extensions: name=VK_EXT_pci_bus_info, enabled=0
[Vulkan init] extensions: name=VK_EXT_pipeline_creation_cache_control, enabled=0
[Vulkan init] extensions: name=VK_EXT_pipeline_creation_feedback, enabled=0
[Vulkan init] extensions: name=VK_EXT_post_depth_coverage, enabled=0
[Vulkan init] extensions: name=VK_EXT_private_data, enabled=0
[Vulkan init] extensions: name=VK_EXT_robustness2, enabled=0
[Vulkan init] extensions: name=VK_EXT_sample_locations, enabled=0
[Vulkan init] extensions: name=VK_EXT_sampler_filter_minmax, enabled=0
[Vulkan init] extensions: name=VK_EXT_scalar_block_layout, enabled=0
[Vulkan init] extensions: name=VK_EXT_separate_stencil_usage, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_atomic_float, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_demote_to_helper_invocation, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_image_atomic_int64, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_subgroup_ballot, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_subgroup_vote, enabled=0
[Vulkan init] extensions: name=VK_EXT_shader_viewport_index_layer, enabled=0
[Vulkan init] extensions: name=VK_EXT_subgroup_size_control, enabled=0
[Vulkan init] extensions: name=VK_EXT_texel_buffer_alignment, enabled=0
[Vulkan init] extensions: name=VK_EXT_tooling_info, enabled=0
[Vulkan init] extensions: name=VK_EXT_transform_feedback, enabled=0
[Vulkan init] extensions: name=VK_EXT_vertex_attribute_divisor, enabled=0
[Vulkan init] extensions: name=VK_EXT_ycbcr_image_arrays, enabled=0
[Vulkan init] extensions: name=VK_NV_clip_space_w_scaling, enabled=0
[Vulkan init] extensions: name=VK_NV_compute_shader_derivatives, enabled=0
[Vulkan init] extensions: name=VK_NV_cooperative_matrix, enabled=0
[Vulkan init] extensions: name=VK_NV_corner_sampled_image, enabled=0
[Vulkan init] extensions: name=VK_NV_coverage_reduction_mode, enabled=0
[Vulkan init] extensions: name=VK_NV_cuda_kernel_launch, enabled=0
[Vulkan init] extensions: name=VK_NV_dedicated_allocation, enabled=0
[Vulkan init] extensions: name=VK_NV_dedicated_allocation_image_aliasing, enabled=0
[Vulkan init] extensions: name=VK_NV_device_diagnostic_checkpoints, enabled=0
[Vulkan init] extensions: name=VK_NV_device_diagnostics_config, enabled=0
[Vulkan init] extensions: name=VK_NV_device_generated_commands, enabled=0
[Vulkan init] extensions: name=VK_NV_fill_rectangle, enabled=0
[Vulkan init] extensions: name=VK_NV_fragment_coverage_to_color, enabled=0
[Vulkan init] extensions: name=VK_NV_fragment_shader_barycentric, enabled=0
[Vulkan init] extensions: name=VK_NV_fragment_shading_rate_enums, enabled=0
[Vulkan init] extensions: name=VK_NV_framebuffer_mixed_samples, enabled=0
[Vulkan init] extensions: name=VK_NV_geometry_shader_passthrough, enabled=0
[Vulkan init] extensions: name=VK_NV_mesh_shader, enabled=0
[Vulkan init] extensions: name=VK_NV_ray_tracing, enabled=0
[Vulkan init] extensions: name=VK_NV_representative_fragment_test, enabled=0
[Vulkan init] extensions: name=VK_NV_sample_mask_override_coverage, enabled=0
[Vulkan init] extensions: name=VK_NV_scissor_exclusive, enabled=0
[Vulkan init] extensions: name=VK_NV_shader_image_footprint, enabled=0
[Vulkan init] extensions: name=VK_NV_shader_sm_builtins, enabled=0
[Vulkan init] extensions: name=VK_NV_shader_subgroup_partitioned, enabled=0
[Vulkan init] extensions: name=VK_NV_shading_rate_image, enabled=0
[Vulkan init] extensions: name=VK_NV_viewport_array2, enabled=0
[Vulkan init] extensions: name=VK_NV_viewport_swizzle, enabled=0
[Vulkan init] extensions: name=VK_NVX_binary_import, enabled=0
[Vulkan init] extensions: name=VK_NVX_image_view_handle, enabled=0
[Vulkan init] extensions: name=VK_NVX_multiview_per_view_attributes, enabled=0
[Vulkan init] extensions: name=VK_KHR_acceleration_structure, enabled=0
[Vulkan init] extensions: name=VK_KHR_ray_query, enabled=0
[Vulkan init] extensions: name=VK_KHR_ray_tracing_pipeline, enabled=0
Caught fatal signal - signo:11 code:1 errno:0 addr:(nil)
Obtained 14 stack frames.
#0  0x007f11b850e980 in funlockfile
#1  0x007f11b00410cb in vkGetDeviceProcAddr
#2  0x007f11b9416b55 in vulkan::LoadVulkanLibraryPhase3(VkInstance_T*, VkDevice_T*)
#3  0x007f11b93d29e0 in vk::Initialize()
#4  0x007f11b93d99f9 in CreateVKGfxDevice()
#5  0x007f11b97dc71f in CreateRealGfxDevice(GfxDeviceRenderer)
#6  0x007f11b934c47e in CreateClientGfxDevice(GfxDeviceRenderer, GfxCreateDeviceFlags)
#7  0x007f11b97dd017 in CreateGfxDevice(GfxDeviceRenderer, GfxCreateDeviceFlags)
#8  0x007f11b97dd3a1 in InitializeGfxDevice()
#9  0x007f11b96c7666 in InitializeEngineGraphics(bool)
#10 0x007f11b96d7670 in PlayerInitEngineGraphics(bool)
#11 0x007f11b986dbb7 in PlayerMain(int, char**)
#12 0x007f11b812cbf7 in __libc_start_main
#13 0x00000000400569 in _start

Thanks again for the help on this.

ekolve commented 3 years ago

Could you provide the following info?

We run on the P2 and P3 instances without any issues using the Nvidia driver that is included with CUDA 11.2 - 460.32.03 on Ubuntu 18.04 and don't encounter any segfaults.

SamNPowers commented 3 years ago

Current AWS: g4dn.8xlarge (I also tried on a different one, not sure of its type) OS: ubuntu 18.04 Driver: Originally I was using 450, but I upgraded to 460 and I'm still seeing issues. AI2Thor version: 2.7.4

PyTorch does not currently support CUDA 11.2 so I can't try it; instead I'm using 11.1. (I had been using 11.0)

I cleared everything out and reinstalled everything (after the driver update), and now I'm consistently seeing these issues:

  1. Periodically (but consistently), I'm seeing instantiating a new Controller hang. The Player.log at that point will say "No supported renderers found, exiting" (even though when I'm not in whatever bad state this is getting into, there is a supported render). When this happens is not consistent. Increasing the MaxClients in the xorg.conf may have increased how long it takes to fail, but I'm not sure. The number of Thor processes running is no where near the max (at my current failure point, there are 62 Thor instances running, and MaxClients is 512). This one seems to be related to suspending an existing Thor process and then starting up a new one - the new one hangs. See the repro below.
  2. Sometimes when I try to reset an existing environment I'm seeing: "Unity process has exited - check Player.log for errors. Last action message: b'{"action": "Reset", "sceneName": "FloorPlan21_physics", "sequenceId": 0}', returncode=-11". I run this reset a lot, but only sometimes it seems to fail. In this case the Player.log does not seem to show any errors. (I can paste the full log if desired.)
  3. Rarely, I see a segmentation fault. I think this may have been happening when old Thor instances were not being killed properly, and over time the GPU would run out of memory and segfault. I haven't been seeing this one very often, and a reboot seems to fix it.

I'm not using docker at all (all of this is on the host), using the x server script you provided.

Here is a repro for 2. It works with fewer than 128 processes as well; more just makes it happen faster/more consistently.

from ai2thor.controller import Controller
from multiprocessing import Process
import time

def create_controller(controller_id):
    scene_name = "FloorPlan21"
    print(f"Creating controller {controller_id}")
    controller = Controller(scene=scene_name, gridSize=0.25, width=84, height=84)
    print(f"Controller {controller_id} created")
    while True:
        controller.step(action="MoveAhead", moveMagnitude=0)
        time.sleep(0.1)
        controller.reset(scene_name)

processes = []
for i in range(128):
    proc = Process(target=create_controller, args=(i,))
    processes.append(proc)
    proc.start()

for proc in processes:
    proc.join()

Here is a repro for 1:

from ai2thor.controller import Controller
from torch.multiprocessing import Pool
from multiprocessing import Process
import time
import psutil

def create_controller(controller_id):
    scene_name = "FloorPlan21"
    print(f"Creating controller {controller_id}")
    controller = Controller(scene=scene_name, gridSize=0.25, width=84, height=84)
    print(f"Controller {controller_id} created")
    while True:
        controller.step(action="MoveAhead", moveMagnitude=0)
        time.sleep(0.1)
        controller.reset(scene_name)

processes = []
for i in range(64):
    proc = Process(target=create_controller, args=(i,))
    processes.append(proc)
    proc.start()

time.sleep(30)  # Let the controllers start

for proc_id, proc in enumerate(processes):
    psutil.Process(proc.pid).suspend()
    print(f"Actor {proc_id} suspended")

time.sleep(5)  # Just to give a moment to see that everything suspended

# In here is where controller instantiation hangs.
async_objs = []
with Pool(processes=32) as pool:
    for controller_id in range(32):
        async_obj = pool.apply_async(create_controller, (controller_id,))
        async_objs.append(async_obj)

    for obj in async_objs:
        _ = obj.get()

Note that the repro for 1 starts with the repro for 2, so sometimes you'll see both in the same run.

SamNPowers commented 3 years ago

A sleep of 1.0 second meant my code didn't immediately hang, but it did after about 900k frames. So far a sleep of 3.0 seconds hasn't hung though, so...so far so good on that front. Edit: spoke too soon. Hung not long after I posted.

Glad to hear you tracked it down in Unity.

ekolve commented 3 years ago

I was able to reproduce the segfault by using the default project that Unity provides with version 2020.3 and submitted the issue to Unity's issue tracker. Today they acknowledged the issue and confirmed that they are able to reproduce the error on their side:

https://issuetracker.unity3d.com/issues/running-too-many-instances-of-the-standalone-player-causes-new-ones-to-crash