bupticybee / TexasSolver

πŸš€ A very efficient Texas Holdem GTO solver :spades::hearts::clubs::diamonds:
https://bupticybee.github.io/texassolver_page
GNU Affero General Public License v3.0
1.71k stars 304 forks source link

[BUG REPORT] Random failure in CI workflow #43

Open Endle opened 3 years ago

Endle commented 3 years ago

I found that some of my irrelevant changes may break the CI. I created a test to repeat the original test cases

https://github.com/Endle/TexasSolver/pull/2

Now I confirmed that the master branch test case may hit random failures.

https://github.com/Endle/TexasSolver/pull/2/checks?check_run_id=3492197937

I'm not sure if this is caused by multi-thread race conditions

Endle commented 3 years ago

My test code

        for i in {1..20}
        do
           echo "Test ID: $i"
           ./console_solver -i resources/text/commandline_sample_input.txt || exit 1
        done
bupticybee commented 3 years ago

Now this is very wired. I come across this error in google colab a few times, it seems pretty random to me.

However I still can't replicate the error in my pc or mac. Not to mention debug it.

Any idea what's going on here?

Endle commented 3 years ago

@bupticybee I have no ideas yet. How about increasing the number of repeats? At least it can show the error rate to us

bupticybee commented 3 years ago

@bupticybee I have no ideas yet. How about increasing the number of repeats? At least it can show the error rate to us

Can you replicate it in your pc? I'm having problem replicate it on my pc.

Endle commented 3 years ago

On my PC (Fedora Linux)

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestCase
[ RUN      ] TestCase.test_poker_solver_bench
[##################################################] 100%
unknown file: Failure
C++ exception with description "[json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal" thrown in the test body.
[  FAILED  ] TestCase.test_poker_solver_bench (21061 ms)
[----------] 1 test from TestCase (21061 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (21061 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestCase.test_poker_solver_bench

 1 FAILED TEST

However, can't reproduce suce issue

build/console_solver -i resources/text/commandline_sample_input.txt
bupticybee commented 3 years ago

It's a json parse issue? how can it be? This project desn't read data from any json file if it's start by console solver, I only use json parser once when dump the strategy.

This doesn't seem like the same issue with the ci.

And you run from build? Does it run into the same bug in install?

jcbrtl commented 3 years ago

@Endle , it's not about the changes you made at all.

I've just compiled both codes (the release version and the master branch version) on macOS Big Sur (11.4) using llvm@12, and both versions didn't pass the test (output exactly like yours) and _consolesolver couldn't run once w/o a crash. So yeah, not good. Here's some - silent - outs:

$ ./console_solver -i ../resources/text/commandline_sample_input.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 8 threads
Iter: 0
Segmentation fault: 11

$ ./console_solver -i ../resources/text/commandline_sample_input.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 8 threads
Iter: 0
console_solver(9243,0x116a28e00) malloc: Double free of object 0x7fdb07406c20

$ ./console_solver -i ../resources/text/benchmark_texassolver.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 6 threads
Iter: 0
console_solver(9502,0x10c5f0e00) malloc: *** error for object 0x7fdfa0f05050: pointer being freed was not allocated
console_solver(9502,0x10c5f0e00) malloc: *** set a breakpoint in malloc_error_break to debug
Segmentation fault: 11

$ ./console_solver -i ../resources/text/benchmark_texassolver.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 6 threads
Iter: 0
console_solver(9532,0x10d0f8e00) malloc: Double free of object 0x7ffc17d062b0
console_solver(9532,0x70000e569000) malloc: *** error for object 0x7ffc17d06300: pointer being freed was not allocated
console_solver(9532,0x70000e569000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

$ ./console_solver -i ../resources/text/benchmark_texassolver.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 6 threads
Iter: 0
console_solver(9541,0x700012000000) malloc: Incorrect checksum for freed object 0x7fc9187070d8: probably modified after being freed.
Corrupt value: 0x40b0000004000
console_solver(9541,0x700012000000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

$ ./console_solver -i ../resources/text/benchmark_texassolver.txt 
EXEC FROM FILE

<<<START SOLVING>>>
Using 6 threads
Iter: 0
**<running endless w/ high CPU usage (but very low MEM usage)...>**

@bupticybee , hope it gives you a clue.

That's all, folks.

bupticybee commented 3 years ago

I've just compiled both codes (the release version and the master branch version) on macOS Big Sur (11.4) using llvm@12, and both versions didn't pass the test (output exactly like yours) and _consolesolver couldn't run once w/o a crash. So yeah, not good. Here's some - silent - outs:

Wired discovery, I will probably make another mac release in a few days. I actually forget to link some library in mac. Not sure whether it's related.

However I never run into the same error as yours. Very wired indeed.

I compile on MacOs Catalina (10.15.7) using clion with gcc-7 and I never run into any kind of error.

ζˆͺ屏2021-09-03 δΈ‹εˆ12 25 51

Cmake options:

-D CMAKE_CXX_COMPILER=/usr/local/Cellar/gcc@7/7.5.0_3/bin/g++-7 -D CMAKE_C_COMPILER=/usr/local/Cellar/gcc@7/7.5.0_3/bin/gcc-7

If you are still interested in using the code of this project, you can try use the same toolchain as I did.

jcbrtl commented 3 years ago

If you are still interested in using the code of this project, you can try use the same toolchain as I did.

@bupticybee , just to let you know, it also didn't work with gcc@11. The very same issues.

But (curiously) with gcc@7 the behavior changed a little (all the rest is the same):

$ ./test 
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestCase
[ RUN      ] TestCase.test_poker_solver_bench
[##################################################] 100%
Abort trap: 6

Both gcc versions found (and used) the same OpenMP 4.5 API and the code is the same master branch version compiled on macOS Big Sur (11.4).

Hope it helps. Thanks!

bupticybee commented 3 years ago

If you are still interested in using the code of this project, you can try use the same toolchain as I did.

@bupticybee , just to let you know, it also didn't work with gcc@11. The very same issues.

But (curiously) with gcc@7 the behavior changed a little (all the rest is the same):

$ ./test 
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestCase
[ RUN      ] TestCase.test_poker_solver_bench
[##################################################] 100%
Abort trap: 6

Both gcc versions found (and used) the same OpenMP 4.5 API and the code is the same master branch version compiled on macOS Big Sur (11.4).

Hope it helps. Thanks!

Thank you for your effort. I forget to mention that the test is excepted not to work, it requires more local modification that I didn't submit to the codebase. I now test on this command and this command only:

cd install 
./console_solver -i resources/text/commandline_sample_input.txt

My fault not to mention this in readme, and the readme is fixed in a4129e82f1862875f5e9abba778ae70ea042828c

Can you be so kind to check whatever the error still show up? I test the code in Macos,Windows and Linux, there is no reason to come across the kind of error you just mentioned.

BTW, did you check https://colab.research.google.com/drive/1NWDb53ypcKpkb3g3orzEBDeHAEkAIC7y ?

jcbrtl commented 3 years ago

Can you be so kind to check whatever the error still show up?

The same errors as before.

BTW, did you check https://colab.research.google.com/drive/1NWDb53ypcKpkb3g3orzEBDeHAEkAIC7y ?

I installed Ubuntu 18.04 mini as guest OS on VirtualBox (same macOS Big Sur (11.4) host) so I could follow your lines -- and I followed it, line by line.

The same errors apply: seg faults, double free abortions and corruptions.

I believe you (or anyone) can reproduce it w/ this easy setup. Thanks!

P.S.: Let's be practical, I uploaded the VM files (content of ~/VirtualBox VMs) to G Drive. Please check. User/password: poker

bupticybee commented 3 years ago

Can you be so kind to check whatever the error still show up?

The same errors as before.

BTW, did you check https://colab.research.google.com/drive/1NWDb53ypcKpkb3g3orzEBDeHAEkAIC7y ?

I installed Ubuntu 18.04 mini as guest OS on VirtualBox (same macOS Big Sur (11.4) host) so I could follow your lines -- and I followed it, line by line.

The same errors apply: seg faults, double free abortions and corruptions.

I believe you (or anyone) can reproduce it w/ this easy setup. Thanks!

P.S.: Let's be practical, I uploaded the VM files (content of ~/VirtualBox VMs) to G Drive. Please check. User/password: poker

I will try to reproduce the error, thanks for the reproduction procedure.

And I just check the notebook I provided, it still works in google colab. By following the same command line by line doesn't mean you will get the same result, because colab has different environment setup, you got to check whether the cmake, gcc, openmp version, etc is the same as yours.

jcbrtl commented 3 years ago

(...) you got to check whether the cmake, gcc, openmp version, etc is the same as yours.

Oh, the setup was built to be the closest as possible, as you'll see. So CMake, GCC & OpenMP are identical.

bupticybee commented 3 years ago

(...) you got to check whether the cmake, gcc, openmp version, etc is the same as yours.

Oh, the setup was built to be the closest as possible, as you'll see. So CMake, GCC & OpenMP are identical.

Thanks for the infomation, I will definately look into this.