CleverRaven / Cataclysm-DDA

Cataclysm - Dark Days Ahead. A turn-based survival game set in a post-apocalyptic world.
http://cataclysmdda.org
Other
10.62k stars 4.17k forks source link

Repeated CI failure in different parts of the code base #58459

Closed Willenbrink closed 1 year ago

Willenbrink commented 2 years ago

Describe the bug

I'm currently running into repeated failures on the CI which I can't reproduce and which occur in different parts of the code. I can't exclude that the failures are caused by changes in my PRs but I don't see any obvious cause as I'm not working with low-level memory management. As it happens in two different PRs I wanted to collect information here on the failures and see if others also run into this issue.

Steps to reproduce

  1. Submit a PR
  2. It fails due to backtraces / segfaults.
  3. Push again
  4. It fails again at a different place

Expected behavior

  1. Submit a PR
  2. It runs through without failure

If the failures were reproducible locally by using the same seed it would also be fine.

Screenshots

No response

Versions and configuration

See the attached links. It seems to happen with multiple different CI actions, at least:

Additional context

The PRs that fail repeatedly: #56705 and #58414 Some of the failed runs and the failures: Last non-debug function in backtrace Run
MapgenRemovePartHandler::add_item_or_charges https://github.com/CleverRaven/Cataclysm-DDA/runs/6916312836
map::furn_set https://github.com/CleverRaven/Cataclysm-DDA/runs/6889292009
item::is_corpse() https://github.com/CleverRaven/Cataclysm-DDA/runs/6885041134
MapgenRemovePartHandler::add_item_or_charges https://github.com/CleverRaven/Cataclysm-DDA/runs/6887371874
Segfault without backtrace https://github.com/CleverRaven/Cataclysm-DDA/runs/6887943506
Failing overmap test https://github.com/CleverRaven/Cataclysm-DDA/runs/6917109222
Failing assertion related to items and vehicle_parts https://github.com/CleverRaven/Cataclysm-DDA/runs/7196578084
descan commented 2 years ago

I have a VERY simple PR going, 58431, that changes 2 files in one line each, neither of which are code either - just player-facing UI. It's also failing on GCC 9, Curses, LTO. Unless these lines are being called somewhere else, I don't think they should be causing issues like that.

It might be simplest PR you can get that's also causing the failure, which might be useful for finding out what the issue is.

Willenbrink commented 2 years ago

Cool, for some reason I can't edit my issue anymore as Github complains about changes to the text during editing. Anyway, thanks for the info. That will hopefully be useful. It seems that all failures are related to maps but share no other similarities. Perhaps this is some sort of map corruption.

I'm unfortunately a bit busy for the next few days but if someone wants to take a look, I would start with the test failure below as that might be easier to reproduce than the segfaults.

Last function Run
map::add_vehicle https://github.com/CleverRaven/Cataclysm-DDA/runs/6890574747
MapgenRemovePartHandler::add_item_or_charges https://github.com/CleverRaven/Cataclysm-DDA/runs/6929376451
Test failure https://github.com/CleverRaven/Cataclysm-DDA/runs/6917109222
Willenbrink commented 2 years ago

Looks like the test failure might be unrelated, it is being fixed in #58442.

kevingranade commented 2 years ago

I don't think this is the same thing. #58442 adresses a recurring consistency check failure in structure creation, the issue you report here is less routine.

BrettDong commented 2 years ago

It might be worth mentioning here that I am also experiencing segmentation faults in explosion tests in the binary compiled with GCC 11.2 with LTO enabled.

Willenbrink commented 2 years ago

@BrettDong Can you provide the exact commands that lead to the segfault? I just tried to reproduce it but couldn't with GCC 12.1.1 and TILES=1 SOUND=1 LTO=1. I've got a new CPU now and am ready for some recompiling to pin down the issue. I don't think it's related to the specific version of GCC as it also occurs with Clang.

BrettDong commented 2 years ago

We also see some random crashes in the GCC LTO CI tests on GitHub Actions recently, see discussions in #59148.

BrettDong commented 2 years ago

@BrettDong Can you provide the exact commands that lead to the segfault? I just tried to reproduce it but couldn't with GCC 12.1.1 and TILES=1 SOUND=1 LTO=1. I've got a new CPU now and am ready for some recompiling to pin down the issue. I don't think it's related to the specific version of GCC as it also occurs with Clang.

It is not a deterministic crash. I got one crash in like every 20-50 runs.

Willenbrink commented 2 years ago

Hmm, okay. That's quite rare. I will try it a few more times.

jbytheway commented 2 years ago

In this comment @BrettDong points out one failure that had the message

free(): invalid next size (fast)

which indicates heap corruption. If there is a heap corruption bug that would explain why we see random nondeterministic failures in various places, because heap corruptions can manifest in many bizarre ways.

Usually, the best way to investigate heap corruptionis to compile with AddressSanitizer (ASan). We do have ASan builds in CI. Have any of the above issues been on ASan builds?

Willenbrink commented 2 years ago

Yes, this one, this two, this three. Unless I'm misunderstanding you? I've considered approaching this with the rr-debugger but haven't looked into ASan yet.

jbytheway commented 2 years ago

Yes, this one, this two, this three.

I believe all three of those are examples of the bug I fixed in #59141, so they are unrelated to the crashing bugs seen in some of these other examples. I was hoping for something that had segfaulted under ASan.

BrettDong commented 2 years ago

Yes, this one, this two, this three. Unless I'm misunderstanding you? I've considered approaching this with the rr-debugger but haven't looked into ASan yet.

These are vehicle placement bugs, not heap corruption errors we are discussing.

Willenbrink commented 2 years ago

Ah, got you. No, I haven't noticed any heap corruption with ASan.

mqrause commented 2 years ago

https://github.com/CleverRaven/Cataclysm-DDA/runs/7335244668?check_suite_focus=true#step:16:671 This might be of interest here.

Stadler76 commented 2 years ago

Here are two Windows builds failing with bad allocation:

I assume, these are related?

Stadler76 commented 2 years ago

Here's a successful build, but with the TEST_CASE( "overmap_terrain_coverage", "[overmap][slow]" ) (the very same testcase that crashed with bad allocation on the Windows builds I've linked) running extremely slow (\~11 Minutes): https://github.com/CleverRaven/Cataclysm-DDA/runs/7568231915#step:16:610

Note, that I'm not blaming the test-case itself, but that might help narrowing down the culprit.

Zireael07 commented 2 years ago

@Stadler76: Most likely related to #55104 then

Stadler76 commented 2 years ago

@Stadler76: Most likely related to #55104 then

So it is the test to be blamed? I was suspecting the mapgen function itself. Anyway, thanks for the pointer.

Zireael07 commented 2 years ago

The test uncovered a problem in one of the mapgen functions. I think the problem was just recently fixed, so tomorrow's build hopefully should not exhibit this particular problem

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Please do not bump or comment on this issue unless you are actively working on it. Stale issues, and stale issues that are closed are still considered.