actions / runner-images

GitHub Actions runner images
MIT License
10.13k stars 3.05k forks source link

Actions running same script produce different artifact results #3022

Closed manongjohn closed 3 years ago

manongjohn commented 3 years ago

Description
I'm getting different results with "gcc" compiled Linux artifacts generated by Actions that are running the same steps outlined in this script but either running just "gcc" alone or running "gcc + clang" configurations:

https://github.com/tahoma2d/tahoma2d/blob/master/.github/workflows/linux_build.yml,

Up front, my question is: Is there something different about the environment that is not written to the logs when I run "gcc+clang" configurations at the same time vs when I run "gcc" by itself?

Area for Triage:
Artifacts

Question, Bug, or Feature?:
Question/Bug

Virtual environments affected

Image version 20210317.1

Expected behavior
Artifacts from a "gcc" only run to work like the "gcc" artifact from a "gcc+clang" run

Actual behavior
When I run the above script with both "gcc + clang" configurations enabled, the gcc artifact works fine. If I run the same script but only run "gcc" configuration, the gcc artifact segfaults when I start it.

I would expect the gcc artifiacts to work in either case since the only difference in the script is if clang is also building at the same time or not.

So I compared the "gcc" output from both runs to see what might be different:

"gcc + clang" generating good gcc artifiact: https://github.com/tahoma2d/tahoma2d/runs/2172015838?check_suite_focus=true "gcc only" generating bad gcc artifact: https://github.com/tahoma2d/tahoma2d/runs/2171845696?check_suite_focus=true

I compared them and for the most party, they are extremely similar with the exception of this from log # 2 which I think is causing the segfault

rng-double.c:64:3: note: loop vectorized rng-double.c:64:3: note: loop versioned for vectorization because of possible aliasing

I do locally build this in my own linux environment, using the same scripts/steps and using gcc. I don't see this message when compiling and my build doesn't segfault

Side note: The "clang" artifact gives me the same issue. It is failing in the same place as the fully gcc compiled artifact. The clang log, again, looks very similar to the gcc log with the noted warning above.

Any insight into environmental differences you can provide would be appreciated.

AlenaSviridenko commented 3 years ago

Hi @manongjohn, we need some time for initial investigation, we will get back with the results once we have something

dsame commented 3 years ago

@manongjohn are you sure ccache is not the problem? I see you keep intermediate objects between the builds and they are definitely not compatible across the compiler tools. Can you please remove - uses: actions/cache@v2 step and check the result binaries?

manongjohn commented 3 years ago

I don't build that particular library (libmypaint) using ccache. The scripts to configure and make libmypaint don't support using ccache very well so I don't even force using like I have to do with other build scripts in the same run.

manongjohn commented 3 years ago

Oh. If should also mention, even before i introduced ccache, i was having the same issue.

dsame commented 3 years ago

@manongjohn can you please clarify which binary segfaults ?

manongjohn commented 3 years ago

Tahoma2D.AppImage

dsame commented 3 years ago

@manongjohn i reviewed compiler options for both "gcc only" and "gcc+clang" builds and they are very same. Moreover i was not able to reproduce segfault by running the Tahoma2D.AppImage in the pipeline https://github.com/dsame/tahoma2d/runs/2246340007?check_suite_focus=true

In order to idntify the reason of the segfault can you please send us a postmortem core dump of the failed binary?

the steps to get it:

sudo /bin/sh -c 'echo "core" > /proc/sys/kernel/core_pattern'
ulimit -c unlimited

Running the app and having it seagfaulted prduces a file named core in the current directory

In order to get traceback of the segfault you might need to install gdb

sudo apt-get install gdb

And the following command will print the traceback we need to find out the reason of the crash:

echo 'bt full'|gdb path_to_the_binary -c core

manongjohn commented 3 years ago

Here is a zip containing the coredump and a log of the command/results of the gdb backtrace.

Since your command didn't provide much output, I executed the app through gdb to get the backtrace

coredump.zip

dsame commented 3 years ago

I confirm the builds uses the very same make files and options, but i noticed there are different set of dependencies install on the very first step. This can cause the the different build log outputs. I am going to dig into the differences i noticed, but meanwhile is it possible to reproduce the segfault withing the workflow? I tried to launch the app with xvfb x11 server but it just wait for some user input till the timeout. Does it mean the produced binary is not assumed to segfault at all?

manongjohn commented 3 years ago

Thank you for continuing to look further into this.

I've never tried to have it start in the workflow.

Normally the bad version will segfault within a few seconds after I try to start it. The splash screen doesn't even come up. The good version will show a splash screen then load the application and provide a dialog box prompting the user for input.

Could you share the list of dependencies that are different between the 2 builds? I can see if I have or don't have it in my local build environment and maybe find what may be causing the difference.

dsame commented 3 years ago

@manongjohn these builds of forked repo https://github.com/dsame/tahoma2d/runs/2234450603?check_suite_focus=true https://github.com/dsame/tahoma2d/runs/2235408221?check_suite_focus=true

have different apt logs


gcc + clang:
0 upgraded, 173 newly installed, 0 to remove and 38 not upgraded.
gcc:
0 upgraded, 174 newly installed, 0 to remove and 21 not upgraded.
``
but i hardly believe this relates to the issue.
We definitely have different SSE2 instructions used during the build and this might be caused either some default gcc settings or 3rd party dependencies or different environments the build runs in. 

I double checked the libmypaint source and confirmed it has not changed since 2019

My next step is to remove caching just to avoid possible conflicts and try to reduce the build by removing the steps trying to figure out the component that brings the difference in the libmypaint build output
dsame commented 3 years ago

it is confirmed the different build output log does not relate to clang/gcc combination

GCC only build https://github.com/dsame/tahoma2d/runs/2318343127?check_suite_focus=true has no rng-double.c:64:3: note: loop vectorized and the very same (triggered with git commit --allow-empty ...) build https://github.com/dsame/tahoma2d/runs/2318341236?check_suite_focus=true - only has rng-double.c:64:3: note: loop vectorized

dsame commented 3 years ago

@madhurig running the build with ccache removed (not the cache step, but ccache call with cc/cxx) 4 times i got no one rng-double.c:64:3: note: loop vectorized

https://github.com/dsame/tahoma2d/runs/2320577994?check_suite_focus=true https://github.com/dsame/tahoma2d/runs/2320577535?check_suite_focus=true https://github.com/dsame/tahoma2d/runs/2320576850?check_suite_focus=true https://github.com/dsame/tahoma2d/runs/2320574952?check_suite_focus=true

I can suggest the problem is the different steps (inside the single job) use the same cache and there're 3rd party dependencies that brings the different headers from the different mirrors. Sounds crazy but i see nothing could be changed in the build.

Can you please either use the step-dedicated caches or remove a cache from some step? I am not expert in ccache and not able to do this effectively/least performance impact.

From now my only solution is to remove ccache and hope the problem will not arise.

manongjohn commented 3 years ago

Not sure what might have changed, but for whatever reason, I am suddenly unable to duplicate this problem when I try to recreate the issue. Because of this, there is no point continuing any more investigation so I will close the issue.

Thanks all that took the time to investigate.