drwhut / tabletop-club

An open-source platform for playing tabletop games in a physics-based 3D environment for Windows, macOS, and Linux! Made with the Godot Engine.
https://tabletopclub.net
MIT License
1.27k stars 55 forks source link

Rare signal disconnect causes crash loop, creates huge log files. #131

Open Saiv46 opened 1 year ago

Saiv46 commented 1 year ago

Describe the bug Game keeps all logs, and sometimes produces huge one (in case of error).

To Reproduce Steps to reproduce the behavior:

  1. Open game
  2. Leave it open for long time
  3. ???
  4. Low Disk Space on "filesystem root"

Expected behavior Game should rotate logs or at least crash if something went wrong.

Environment

System:
  Kernel: 5.15.0-53-generic x86_64 bits: 64 compiler: gcc v: 11.2.0 Desktop: Cinnamon 5.4.12
    tk: GTK 3.24.33 wm: Mutter dm: LightDM Distro: Linux Mint 21 Vanessa base: Ubuntu 22.04 jammy
Machine:
  Type: Desktop Mobo: Gigabyte model: H410M S2H v: x.x serial: <superuser required>
    UEFI: American Megatrends v: F3 date: 08/28/2020
CPU:
  Info: quad core model: Intel Core i3-10100 bits: 64 type: MT MCP arch: Comet Lake rev: 3 cache:
    L1: 256 KiB L2: 1024 KiB L3: 6 MiB
  Speed (MHz): avg: 819 high: 954 min/max: 800/4300 cores: 1: 800 2: 800 3: 800 4: 800 5: 800
    6: 800 7: 800 8: 954 bogomips: 57600
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: Intel CometLake-S GT2 [UHD Graphics 630] vendor: Gigabyte driver: i915 v: kernel
    ports: active: DP-1 empty: HDMI-A-1,HDMI-A-2,HDMI-A-3 bus-ID: 00:02.0 chip-ID: 8086:9bc8
  Display: x11 server: X.Org v: 1.21.1.3 driver: X: loaded: modesetting unloaded: fbdev,vesa
    gpu: i915 display-ID: :0 screens: 1
  Screen-1: 0 s-res: 1366x768 s-dpi: 96
  Monitor-1: DP-1 model: Samsung S19C150 res: 1366x768 dpi: 85 diag: 470mm (18.5")
  OpenGL: renderer: Mesa Intel UHD Graphics 630 (CML GT2) v: 4.6 Mesa 22.0.5 direct render: Yes
Audio:
  Device-1: Intel Comet Lake PCH-V cAVS vendor: Gigabyte driver: snd_hda_intel v: kernel
    bus-ID: 00:1f.3 chip-ID: 8086:a3f0
  Sound Server-1: ALSA v: k5.15.0-53-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Gigabyte driver: r8169
    v: kernel pcie: speed: 2.5 GT/s lanes: 1 port: 3000 bus-ID: 02:00.0 chip-ID: 10ec:8168
  IF: enp2s0 state: up speed: 100 Mbps duplex: full mac: <filter>
  IF-ID-1: lxcbr0 state: down mac: <filter>
  IF-ID-2: tun0 state: unknown speed: 10 Mbps duplex: full mac: N/A
Partition:
  ID-1: / size: 233.07 GiB used: 125.13 GiB (53.7%) fs: ext4 dev: /dev/dm-0 mapped: recrypt
  ID-2: /boot size: 456 MiB used: 138.7 MiB (30.4%) fs: ext4 dev: /dev/nvme0n1p3
  ID-3: /boot/efi size: 119.1 MiB used: 5.2 MiB (4.4%) fs: vfat dev: /dev/nvme0n1p1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 16.8 C mobo: 16.8 C
  Fan Speeds (RPM): N/A
Info:
  Processes: 310 Uptime: 2h 39m Memory: 15.37 GiB used: 4.41 GiB (28.7%) Init: systemd v: 249
  runlevel: 5 Compilers: gcc: 11.3.0 alt: 11/9 clang: 14.0.0-1ubuntu1 Client: Cinnamon v: 5.4.12
  inxi: 3.3.13

Version v0.1.0 Beta 1

Additional context Game can freeze when capturing its window using OBS Studio (latest).

drwhut commented 1 year ago

Sounds like something (probably an error) is being spammed in the logs - are you able to somehow post a snippet of the logs, or get it to occur again?

Saiv46 commented 1 year ago

Sounds like something (probably an error) is being spammed in the logs - are you able to somehow post a snippet of the logs, or get it to occur again?

After some time I started game, then launched and focused on OBS Studio.

Start time: 17:43:39 End time: 18:02:30 (+18:51) File size: 43GB

Message that was spammed:

ERROR: Disconnecting nonexistent signal 'changed', slot: 6329:_mesh_changed.
   at: _disconnect (core/object.cpp:1538) - Condition "!s->slot_map.has(target)" is true.
drwhut commented 1 year ago

Hmm... my suspicion is that this is a threading issue. godotengine/godot#60098 seems to be the likely culprit - godotengine/godot#58715 and godotengine/godot#34752 may also be related.

As for the logs being spammed, this has already been reported in godotengine/godot#45181, but we may need to implement a fix ourselves for the time being.

drwhut commented 1 year ago

I've been trying to reproduce the issue on the master branch (after v0.1.0-beta3), but to no avail. I've been leaving it on the main menu for upwards of 30 minutes, increasing and decreasing the number of pieces that fall in the background, trying debug vs release builds, and focusing on OBS - it just doesn't want to crash. I think I need help from others to figure out if there is a way to get this issue to consistently happen.

However, I've also had a deep-dive into the Godot source code to see where the problem may potentially lie. The error originates from trying to disconnect the Mesh.changed signal from what I'm guessing is a MeshInstance, which causes an infinite loop, which then causes the spam in the logs. Here's the beginning of Object._disconnect(), which contains the line that is erroring:

void Object::_disconnect(const StringName &p_signal, Object *p_to_object, const StringName &p_to_method, bool p_force) {
    ERR_FAIL_NULL(p_to_object);
    Signal *s = signal_map.getptr(p_signal);
    if (!s) {
        bool signal_is_valid = ClassDB::has_signal(get_class_name(), p_signal) ||
                (!script.is_null() && Ref<Script>(script)->has_script_signal(p_signal));
        ERR_FAIL_COND_MSG(signal_is_valid,
                vformat("Attempt to disconnect a nonexistent connection to signal '%s' in %s, with target '%s' in %s.",
                        p_signal, to_string(), p_to_method, p_to_object->to_string()));
    }
    ERR_FAIL_COND_MSG(!s, vformat("Disconnecting nonexistent signal '%s' in %s.", p_signal, to_string()));

    Signal::Target target(p_to_object->get_instance_id(), p_to_method);

    // The line that is erroring:
    ERR_FAIL_COND_MSG(!s->slot_map.has(target), "Disconnecting nonexistent signal '" + p_signal + "', slot: " + itos(target._id) + ":" + target.method + ".");

From what I can see, there are two possible places that this signal is being disconnected:

  1. When setting the mesh property in MeshInstance:

    void MeshInstance::set_mesh(const Ref<Mesh> &p_mesh) {
    if (mesh == p_mesh) {
        return;
    }
    
    if (mesh.is_valid()) {
        mesh->disconnect(CoreStringNames::get_singleton()->changed, this, SceneStringNames::get_singleton()->_mesh_changed);
    }
  2. During the Object::~Object() destructor:

    Object::~Object() {
    // ...
    
    if (_emitting) {
        //@todo this may need to actually reach the debugger prioritarily somehow because it may crash before
        ERR_PRINT("Object " + to_string() + " was freed or unreferenced while a signal is being emitted from it. Try connecting to the signal using 'CONNECT_DEFERRED' flag, or use queue_free() to free the object (if this object is a Node) to avoid this error and potential crashes.");
    }
    
    while ((S = signal_map.next(nullptr))) {
        Signal *s = &signal_map[*S];
    
        //brute force disconnect for performance
        int slot_count = s->slot_map.size();
        const VMap<Signal::Target, Signal::Slot>::Pair *slot_list = s->slot_map.get_array();
    
        for (int i = 0; i < slot_count; i++) {
            slot_list[i].value.conn.target->connections.erase(slot_list[i].value.cE);
        }
    
        signal_map.erase(*S);
    }
    
    //signals from nodes that connect to this node
    while (connections.size()) {
        Connection c = connections.front()->get();
        c.source->_disconnect(c.signal, c.target, c.method, true);
    }

I think it's the second instance that is causing the error, since it's a destructor, it may be the engine constantly trying to destroy the object (and it can't because of the error in the disconnect method), plus in the game I've put in a bit of code that I think prevents the first option from ever becoming a problem in ResourceManager:

func free_object(object: Object) -> void:
    # Lock the resource mutex here, just in case a resource used by the object
    # is being created as it is being freed here.
    _res_mutex.lock()
    if object is Node:
        # This is a workaround for a crash that sometimes happens when a
        # MeshInstance tries to disconnect its mesh's _mesh_changed signal.
        var node_stack = [object]
        while not node_stack.empty():
            var node: Node = node_stack.pop_back()
            for child in node.get_children():
                node_stack.push_back(child)

            if node is MeshInstance:
                node.mesh = null

        object.queue_free()
    else:
        object.free()
    _res_mutex.unlock()

Whenever something like a piece is freed from memory, the game first goes through all of the child nodes and finds any MeshInstances - if it does, it then sets the mesh being displayed to null before queueing the node to be freed, all while the resource mutex is locked (so resources are confirmed to not be loading at the same time). Since both the Mesh and MeshInstance are still active, this should always ensure that the connections are still alive when set_mesh is called.

So my current working theory is that some threading issue or race condition is causing the signal map to become invalid before Object::_disconnect, which is called by Object::~Object, which then goes into an infinite loop since it's the destructor that is receiving the error.

If we are struggling to replicate the issue, one thing that we could try is to detect if this error happens in the destructor, and make sure the destructor is completed so the error is not spammed. Either that, or we wait for the fix to this issue upstream (which may take a while given the issue mentioned above has a milestone for 4.x). Ideally, the issue would be easily recreatable so we could find out the root cause of the issue, but since this issue occurs so rarely it seems, it may just be worth preventing the spamming of the logs themselves for the time being, since in my opinion the file becoming so large it fills the filesystem is much more of a problem than the game freezing.