actions / runner-images

GitHub Actions runner images
MIT License
9.82k stars 3.01k forks source link

Segmentation fault (core dumped) on Linux Agent #3034

Closed dingmeng-xue closed 3 years ago

dingmeng-xue commented 3 years ago

Description
We hit segmentation fault issue recently on random no matter that we use the latest source code or old source code. We googled it and most of comments point to system level issue. We need your engagement.

https://dev.azure.com/azure-sdk/public/_build/results?buildId=805984&view=logs&j=2f953adc-c56d-55c4-a64a-eab7c4b02235&t=fc7ea605-a507-5208-bc88-3e6a658c906b

Area for Triage:

No idea

Question, Bug, or Feature?:

Question

Virtual environments affected

Image version

Image version where you are experiencing the issue.

Image: ubuntu-18.04 Current agent version: '2.183.1'

Expected behavior
Build csharp project successfully.

Actual behavior
Failed due to segmentation fault

Repro steps
A description with steps to reproduce the issue. If your have a public example or repo to share, please provide the link. https://dev.azure.com/azure-sdk/public/_build/results?buildId=805984&view=logs&j=2f953adc-c56d-55c4-a64a-eab7c4b02235&t=fc7ea605-a507-5208-bc88-3e6a658c906b

LeonidLapshin commented 3 years ago

Hey, @dingmeng-xue ! We need some time for investigation and will back with details soon :) Thank you!

miketimofeev commented 3 years ago

@dingmeng-xue have you tried Ubuntu-20 image? Is the issue reproduced there?

dingmeng-xue commented 3 years ago

No yet. Currently, we only use Ubuntu 18 for Linux build. We still hope to stick to that version.

dsame commented 3 years ago

@dingmeng-xue

I was able to get core dump and investigate it.

It looks the cause of the segfault is some memory block is freed more than once. There're very few chances its origin in the image

The backtrace indicates the exception happens in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 which in turn called from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.4/libcoreclr.so

@miketimofeev unless there're some changes in libcrypto.so.1.1 i believe the issue should be investigated by .net team

2021-03-30T05:26:57.9386665Z GNU gdb (Ubuntu 8.2-0ubuntu1~18.04) 8.2
2021-03-30T05:26:57.9388245Z Copyright (C) 2018 Free Software Foundation, Inc.
2021-03-30T05:26:57.9391342Z License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
2021-03-30T05:26:57.9392599Z This is free software: you are free to change and redistribute it.
2021-03-30T05:26:57.9393671Z There is NO WARRANTY, to the extent permitted by law.
2021-03-30T05:26:57.9394705Z Type "show copying" and "show warranty" for details.
2021-03-30T05:26:57.9396253Z This GDB was configured as "x86_64-linux-gnu".
2021-03-30T05:26:57.9397289Z Type "show configuration" for configuration details.
2021-03-30T05:26:57.9398308Z For bug reporting instructions, please see:
2021-03-30T05:26:57.9399273Z <http://www.gnu.org/software/gdb/bugs/>.
2021-03-30T05:26:57.9400398Z Find the GDB manual and other documentation resources online at:
2021-03-30T05:26:57.9401492Z     <http://www.gnu.org/software/gdb/documentation/>.
2021-03-30T05:26:57.9402116Z 
2021-03-30T05:26:57.9402888Z For help, type "help".
2021-03-30T05:26:57.9403835Z Type "apropos word" to search for commands related to "word"...
2021-03-30T05:26:57.9412060Z Reading symbols from /usr/bin/dotnet...(no debugging symbols found)...done.
2021-03-30T05:26:57.9691950Z [New LWP 3805]
2021-03-30T05:26:57.9693670Z [New LWP 3795]
2021-03-30T05:26:57.9694545Z [New LWP 3794]
2021-03-30T05:26:57.9696327Z [New LWP 3796]
2021-03-30T05:26:57.9699468Z [New LWP 3800]
2021-03-30T05:26:57.9699989Z [New LWP 3797]
2021-03-30T05:26:57.9700331Z [New LWP 3801]
2021-03-30T05:26:57.9706351Z [New LWP 3804]
2021-03-30T05:26:57.9706844Z [New LWP 3809]
2021-03-30T05:26:57.9707274Z [New LWP 3793]
2021-03-30T05:26:57.9707682Z [New LWP 3798]
2021-03-30T05:26:57.9708105Z [New LWP 3799]
2021-03-30T05:26:57.9708524Z [New LWP 3803]
2021-03-30T05:26:57.9708942Z [New LWP 3806]
2021-03-30T05:26:57.9709342Z [New LWP 3807]
2021-03-30T05:26:57.9709764Z [New LWP 3808]
2021-03-30T05:26:57.9710180Z [New LWP 3810]
2021-03-30T05:26:57.9710596Z [New LWP 3811]
2021-03-30T05:26:57.9745642Z [Thread debugging using libthread_db enabled]
2021-03-30T05:26:57.9747186Z Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
2021-03-30T05:26:58.6823436Z Core was generated by `dotnet sln /home/vsts/work/1/s/artifacts/Azure.PowerShell.sln add /home/vsts/wo'.
2021-03-30T05:26:58.6824685Z Program terminated with signal SIGSEGV, Segmentation fault.
2021-03-30T05:26:58.6825335Z #0  __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x0)
2021-03-30T05:26:58.6825902Z     at pthread_rwlock_common.c:576
2021-03-30T05:26:58.6826628Z 576    pthread_rwlock_common.c: No such file or directory.
2021-03-30T05:26:58.6827216Z [Current thread is 1 (Thread 0x7fe479b88700 (LWP 3805))]
2021-03-30T05:26:58.6827821Z (gdb) #0  __pthread_rwlock_wrlock_full (abstime=0x0, rwlock=0x0)
2021-03-30T05:26:58.6828396Z     at pthread_rwlock_common.c:576
2021-03-30T05:26:58.7191887Z         may_share_futex_used_flag = <optimized out>
2021-03-30T05:26:58.7192858Z         wpf = <optimized out>
2021-03-30T05:26:58.7193483Z         ready = <optimized out>
2021-03-30T05:26:58.7194056Z         r = <optimized out>
2021-03-30T05:26:58.7194700Z         may_share_futex_used_flag = <optimized out>
2021-03-30T05:26:58.7195304Z         r = <optimized out>
2021-03-30T05:26:58.7195867Z         wpf = <optimized out>
2021-03-30T05:26:58.7196427Z         ready = <optimized out>
2021-03-30T05:26:58.7197912Z         __value = <optimized out>
2021-03-30T05:26:58.7198542Z         prefer_writer = <optimized out>
2021-03-30T05:26:58.7199695Z         private = <optimized out>
2021-03-30T05:26:58.7200293Z         wf = <optimized out>
2021-03-30T05:26:58.7200858Z         err = <optimized out>
2021-03-30T05:26:58.7201985Z         w = <optimized out>
2021-03-30T05:26:58.7202579Z         w = <optimized out>
2021-03-30T05:26:58.7203129Z         private = <optimized out>
2021-03-30T05:26:58.7203701Z         err = <optimized out>
2021-03-30T05:26:58.7204776Z         w = <optimized out>
2021-03-30T05:26:58.7205368Z         wf = <optimized out>
2021-03-30T05:26:58.7205908Z         wf = <optimized out>
2021-03-30T05:26:58.7206475Z         __value = <optimized out>
2021-03-30T05:26:58.7209120Z #1  __GI___pthread_rwlock_wrlock (rwlock=0x0) at pthread_rwlock_wrlock.c:27
2021-03-30T05:26:58.7209805Z         result = <optimized out>
2021-03-30T05:26:58.7210357Z #2  0x00007fe47860e989 in CRYPTO_THREAD_write_lock ()
2021-03-30T05:26:58.7211539Z    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
2021-03-30T05:26:58.7212164Z No symbol table info available.
2021-03-30T05:26:58.7212692Z #3  0x00007fe4785d0013 in RAND_get_rand_method ()
2021-03-30T05:26:58.7213797Z    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
2021-03-30T05:26:58.7214361Z No symbol table info available.
2021-03-30T05:26:58.7214875Z #4  0x00007fe4785d02f0 in RAND_bytes ()
2021-03-30T05:26:58.7215607Z    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
2021-03-30T05:26:58.7216151Z No symbol table info available.
2021-03-30T05:26:58.7216652Z #5  0x00007fe47858d49f in ?? ()
2021-03-30T05:26:58.7217360Z    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
2021-03-30T05:26:58.7217899Z No symbol table info available.
2021-03-30T05:26:58.7218447Z #6  0x00007fe47859ba97 in EVP_CIPHER_CTX_ctrl ()
2021-03-30T05:26:58.7219186Z    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
2021-03-30T05:26:58.7219707Z No symbol table info available.
2021-03-30T05:26:58.7220594Z #7  0x00007fe4789596b4 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
2021-03-30T05:26:58.7221149Z No symbol table info available.
2021-03-30T05:26:58.7221872Z #8  0x00007fe47894b4fa in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
2021-03-30T05:26:58.7223769Z No symbol table info available.
2021-03-30T05:26:58.7224655Z #9  0x00007fe478945f16 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
2021-03-30T05:26:58.7225237Z No symbol table info available.
2021-03-30T05:26:58.7225758Z #10 0x00007fe4789324c4 in SSL_do_handshake ()
2021-03-30T05:26:58.7226493Z    from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
2021-03-30T05:26:58.7227029Z No symbol table info available.
2021-03-30T05:26:58.7227525Z #11 0x00007fe49ec91343 in ?? ()
2021-03-30T05:26:58.7228193Z No symbol table info available.
2021-03-30T05:26:58.7228686Z #12 0x00007fe479b86d70 in ?? ()
2021-03-30T05:26:58.7229942Z No symbol table info available.
2021-03-30T05:26:58.7230457Z #13 0x00000000000f98f4 in ?? ()
2021-03-30T05:26:58.7230987Z No symbol table info available.
2021-03-30T05:26:58.7232043Z #14 0x00007fe513e11848 in ?? ()
2021-03-30T05:26:58.7232637Z    from /usr/share/dotnet/shared/Microsoft.NETCore.App/5.0.4/libcoreclr.so
2021-03-30T05:26:58.7233216Z No symbol table info available.
2021-03-30T05:26:58.7234204Z #15 0x00007fe479b87c80 in ?? ()
2021-03-30T05:26:58.7234714Z No symbol table info available.
2021-03-30T05:26:58.7235193Z #16 0x00007fe49ed30900 in ?? ()
2021-03-30T05:26:58.7236234Z No symbol table info available.
2021-03-30T05:26:58.7236740Z #17 0x00007fe49ed30900 in ?? ()
2021-03-30T05:26:58.7237246Z No symbol table info available.
2021-03-30T05:26:58.7237724Z #18 0x00007fe479b86d70 in ?? ()
2021-03-30T05:26:58.7242339Z No symbol table info available.
2021-03-30T05:26:58.7245851Z #19 0x00007fe49ec91343 in ?? ()
2021-03-30T05:26:58.7248225Z No symbol table info available.
2021-03-30T05:26:58.7249241Z #20 0x00007fe479b86e00 in ?? ()
2021-03-30T05:26:58.7251192Z No symbol table info available.
2021-03-30T05:26:58.7252669Z #21 0x00007fe49ed309d8 in ?? ()
2021-03-30T05:26:58.7258734Z No symbol table info available.
2021-03-30T05:26:58.7259932Z #22 0x00007fe49ed30900 in ?? ()
2021-03-30T05:26:58.7260595Z No symbol table info available.
2021-03-30T05:26:58.7261218Z #23 0x00007fe47a672ed8 in ?? ()
2021-03-30T05:26:58.7261861Z No symbol table info available.
2021-03-30T05:26:58.7262493Z #24 0x14061d5200000001 in ?? ()
2021-03-30T05:26:58.7263128Z No symbol table info available.
2021-03-30T05:26:58.7263753Z #25 0x0000000000001333 in ?? ()
2021-03-30T05:26:58.7264389Z No symbol table info available.
2021-03-30T05:26:58.7265024Z #26 0x0000000000000000 in ?? ()
2021-03-30T05:26:58.7266045Z No symbol table info available.
2021-03-30T05:26:58.7266640Z (gdb) quit
2021-03-30T05:26:58.7476760Z ##[section]Finishing: Bash
miketimofeev commented 3 years ago

@dingmeng-xue could you please try to use another .net core version? Does it help?

dingmeng-xue commented 3 years ago

Sure. We plan to try another version of .net core. Since this is a random failure to us, it may take couple days to understand the result.

dingmeng-xue commented 3 years ago

After we test dotnet 2.1 in one week, there is no the same issue.

miketimofeev commented 3 years ago

@dingmeng-xue could you address the issue to the .net team then?

maxim-lobanov commented 3 years ago

Closing the issue for now. Please let us know if you have any concerns and it should be reopened after discussion with .NET team

aprilmintacpineda commented 2 years ago

We're experiencing this.

Run yarn eslint . --ext .js,.ts
yarn run v1.22.17
$ eslint . --ext .js,.jsx,.ts,.tsx --fix . --ext .js,.ts
Segmentation fault (core dumped)
error Command failed with exit code 139.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
Error: Process completed with exit code 139.
name: PR linter check

# Controls when the action will run. Triggers the workflow on push or pull request
# events but only for the dev branch
on:
  pull_request:
    branches:
      - main
      - master
      - dev
      - stg
      - uat

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "lint"
  lint-check:
    # The type of runner that the job will run on
    runs-on: ubuntu-20.04
    continue-on-error: false

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@master

      - uses: actions/cache@v2
        with:
          path: '**/node_modules'
          key: ${{ runner.os }}-node-modules-${{ hashFiles('**/yarn.lock') }}

      - name: Install node modules
        run: yarn --prefer-offline

      - name: Lint
        run: yarn eslint . --ext .js,.ts

      # skipLibCheck is temporary because it also excludes our own declaration files
      # https://github.com/microsoft/TypeScript/issues/40426
      - name: TypeScript
        run: tsc --noEmit --skipLibCheck

We use this for PR checks and this happens very often, around 8/10 of the time. Either on the Lint job or the TypeScript job.

kaedenwile commented 2 years ago

+1 Also seeing this on our yarn eslint step. Occurs on ubuntu-20.04 and ubuntu-22.04.

I used mxschmitt/action-tmate@v3 to ssh into the box (after a segmentation fault has occurred) and manually run yarn eslint. I run the command multiple times, one after the other, without changing any code. All runs within the first minute or so will fail with a segmentation fault. After that first minute, running yarn eslintwill succeed and does not segfault.

I'm assuming some race condition resolves???

CharlieGreenman commented 9 months ago

For anyone else that came across this error we had something similar. What it is, is that your action runner is upgraded to node 20 but action runner packages using are on a prior node version. This causing a segmentation fault as packages are incompatible. Once all aligned on same version should fix