Maven Connection timed out

BrightRan commented 4 years ago

Associated community ticket: https://github.community/t/maven-connection-timed-out/129040

Recently the customer is receiving a lot of Connection timed out errors in CI builds from maven when pulling dependencies from https://repo.maven.apache.org and similar public repositories. He has never experienced this error locally even when purging local .m2/repository . The customer is using GitHub-hosted runners in his CI workflow.

andy-mishechkin commented 4 years ago

Hello, @BrightRan Would you, please provide the link to workflow, where you've got the connection timed out errors. Also may you clarify the time, when this errors is impacted. These details help us to investigate this issue. Thank you.

Ginxo commented 4 years ago

Hi, I'm facing the same issue for several projects, here an example https://github.com/kiegroup/appformer/pull/1037/checks?sha=29e3f45162a5e1310d38ec1982adbe0102378662 (I've attached the log in case it's removed and pull_request.yml files appformer_log_and_flow.zip)

We faced the same problem in the past in our internal CI environment (running with jenkins), it seems Maven Central bans and holds the request if there is too many requests, we solved the problem adding a Nexus proxy mirroring maven central. Do you think it's the same? Do you (github) have any kind of internal proxy to deal with it?

Ginxo commented 4 years ago

Just to provide you more info, same job working well (just java8, java11 is known issue not related with this) self-hosted (my laptop) https://github.com/kiegroup/appformer/pull/1037/checks?check_run_id=1040516642 So it seems there's some github stuff in the middle :thinking:

Darleev commented 4 years ago

Hello @Ginxo How often do you see this problem? Does it appear only for a particular package? I believe the issue can be addressed directly to Maven Central Community Support, but need to clarify the details above.

We are looking forward to your reply.

Ginxo commented 4 years ago

HI @Darleev we have several cases here:

appformer case. It's very consistent and happens at the same point/same artifact every time the job is run, it can't get org.codehaus.plexus:plexus-interpolation:jar:1.25 artifacts running org.apache.maven.plugins:maven-war-plugin:3.2.2:war
- you can take any Maven Build (8) from here https://github.com/kiegroup/appformer/pull/1037/checks?check_run_id=1050196122 as an example.
kie-wb-common case this is not consistent but it fails every time the job is run
- Failed to collect dependencies at org.seleniumhq.selenium:selenium-java:jar:3.13.0: Failed to read artifact descriptor for org.seleniumhq.selenium:selenium-java:jar:3.13.0: Could not transfer artifact org.seleniumhq.selenium:selenium-java:pom:3.13.0 from/to central
- Failed to collect dependencies at commons-jxpath:commons-jxpath:jar:1.3: Failed to read artifact descriptor for commons-jxpath:commons-jxpath:jar:1.3: Could not transfer artifact commons-jxpath:commons-jxpath:pom:1.3 from/to central
- you can take any Maven Build (8) from here https://github.com/kiegroup/kie-wb-common/pull/3401/checks?check_run_id=1050198863
drools case this is even less consistent than kie-wb-common case. It works most of the times but fails sometimes
- Failed to collect dependencies at org.asciidoctor:asciidoctorj:jar:2.2.0 -> org.jruby:jruby:jar:9.2.9.0 -> org.jruby:jruby-core:jar:9.2.9.0 -> org.jruby.joni:joni:jar:2.1.30: Failed to read artifact descriptor for org.jruby.joni:joni:jar:2.1.30: Could not transfer artifact org.jruby.joni:joni:pom:2.1.30 from/to central
- Failed to read artifact descriptor for org.antlr:antlr4-maven-plugin:jar:4.8: Could not transfer artifact org.antlr:antlr4-maven-plugin:pom:4.8 from/to central
- you can take any Maven Build (8) from here https://github.com/kiegroup/drools/pull/3063

@Darleev Expcept the appformer case, which is very consistent, it's not happening for the same artifact and as I told you I tried with a self-hosted runner and I couldn't reproduce error so it makes me wonder if you have any kind of proxy configuration on your side. @Darleev thanks for the support, this is blocking our CI @mareknovotny ^

Ginxo commented 4 years ago

just to provide you more information. I run the build inside a docker container from github action using build-chain@openjdk8 and it's working https://github.com/kiegroup/appformer/pull/1037/checks?check_run_id=1055423092 Every step I do points me there's something "weird" on your (github) side

Darleev commented 4 years ago

@Ginxo thank you for the information provided. We are in the midst of the investigation. I will keep you informed.

Darleev commented 4 years ago

@Ginxo let me clarify some details to speed the investigation: 1) Is it possible to implement retry logic in your pipeline for Maven build operation? 2) How is it possible to reproduce the issue on our side? Maybe is it possible to provide us step by step instruction or repository with a code sample where the issue is actual?

Since everything works fine in docker, I believe network issue on the agent machines is not a case here.

Ginxo commented 4 years ago

@Ginxo let me clarify some details to speed the investigation:

Is it possible to implement retry logic in your pipeline for Maven build operation?

How is it possible to reproduce the issue on our side? Maybe is it possible to provide us step by step instruction or repository with a code sample where the issue is actual?

Since everything works fine in docker, I believe network issue on the agent machines is not a case here.

Hi @Darleev replying your questions

Yes it is, but I'm afraid it does not make sense since the appformer one is consistently failing
Do this:

fork https://github.com/kiegroup/appformer
PR over your forked project adding this action with a file called pull_request.yml
```
name: Build Chain
```

on: [pull_request]

jobs: build-chain: strategy: matrix: java-version: [8] fail-fast: false runs-on: ubuntu-latest name: Maven Build steps:

name: Set up JDK uses: actions/setup-java@v1 with: java-version: ${{ matrix.java-version }}
name: Build Chain ${{ matrix.java-version }} id: build-chain uses: kiegroup/github-action-build-chain@v1.4 with: build-command: 'mvn -e -nsu -Dfull -Pwildfly install -Prun-code-coverage -Dcontainer.profile=wildfly -Dcontainer=wildfly -Dintegration-tests=true -Dmaven.test.failure.ignore=true' workflow-file-name: "pull_request.yml" env: GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
```
- Create a PR over this forked project
- Wait for job completion/failre
```

In case you want to try the docker example this would be the action

name: Build Chain

on: [pull_request]

jobs:
  build-chain:
    runs-on: ubuntu-latest
    name: Maven Build (8)
    steps:
      - name: Build Chain
        id: build-chain
        uses: kiegroup/github-action-build-chain@openjdk8
        with:
          build-command: 'mvn -e -nsu -Dfull -Pwildfly install -Prun-code-coverage  -Dcontainer.profile=wildfly -Dcontainer=wildfly -Dintegration-tests=true -Dmaven.test.failure.ignore=true'
          workflow-file-name: "pull_request.yml"
        env:
          GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"

let me know if you need anything else. Cheers, Kike.

LeonidLapshin commented 4 years ago

Hi, @Ginxo! Thank you for detailed build instruction, I have tried to reproduce the problem (network timeouts) with Appformer (the problem was persistent with this repo), but have no luck, build stage is successful :( Link to successful builds (openjdk-8 only) in a forked repo: https://github.com/LeonidLapshin/appformer/actions Used steps: 1) Forked the Appformer repo as well as some other, because they are involved in build process, they are:

droolsjbpm-build-bootstrap
kie-soup
lienzo-tests
lienzo-core 2) made a PR with provided workflow (the one without docker):
```
name: Build Chain
on: [push]
jobs:
build-chain:
strategy:
 matrix:
   java-version: [8]
 fail-fast: false
runs-on: ubuntu-latest
name: Maven Build
steps:
 - name: Set up JDK
   uses: actions/setup-java@v1
   with:
     java-version: ${{ matrix.java-version }}
 - name: Build Chain ${{ matrix.java-version }}
   id: build-chain
   uses: kiegroup/github-action-build-chain@v1.4
   with:
     build-command: 'mvn -e -nsu -Dfull -Pwildfly install -Prun-code-coverage -Dcontainer.profile=wildfly -Dcontainer=wildfly -Dintegration-tests=true -Dmaven.test.failure.ignore=true'
     workflow-file-name: "pull_request.yml"
   env:
     GITHUB_TOKEN: "${{ secrets.GITHUB_TOKEN }}"
```
3) Build stage for openjdk-8 completes successfully 3 times in a row (for openjdk-8), openjdk-11 failed, but it’s not a case, as I understood, please correct me if I am wrong. I guess that there is a possibility that positive experience with successful docker builds was at moment, when the usual build could be successful too (just a hypothesis). It seems that the network problem was temporary or it happens in special conditions, which are unknown. Could you please try to build Appformer once again with Github Actions ubuntu-latest image (without docker) to distinguish is there still a problem? Thank you!

Ginxo commented 4 years ago

I have created a new PR on my own, let's see how it goes https://github.com/LeonidLapshin/appformer/pull/4

Ginxo commented 4 years ago

So that one https://github.com/LeonidLapshin/appformer/pull/4 works and suddenly this (the one which was consistently failing) also works https://github.com/kiegroup/appformer/pull/1037/checks?check_run_id=1070610025 Now the question is what is working randomly (or it was not working)? can we trust on github runner? Thanks guys @LeonidLapshin @Darleev

Ginxo commented 4 years ago

This case (same flow, different project) persists https://github.com/kiegroup/drools/pull/3063/checks?check_run_id=1071097964

LeonidLapshin commented 4 years ago

Hey, @Ginxo ! I made a research within that problem and it seems that it is an Azure-related more than Github-related issue, I guess that the root cause of these read timeouts is the SNAT behavior for network connections on Azure.

Maven creates long-living connections and if they are idle more than 4 minutes (while Maven is busy for a while) they became flushed from Azure VM Balancer’s SNAT, but RST packet is not sent to Maven (on VM side) or remote host (packages destination) so the socket is open but no data is sent over it.

Few assumptions for that error:

Maven handle connections for pooling
Connections are idle more than 4 minutes sometimes
Maven doesn’t implement application layer healthchecks so there is no data sent over the opened connection (not sure)
Azure balancer’s SNAT flush the connections that are idle more than 4 minutes and do not implement RST
Socket is open but no data comes from destination
Maven throws an error because of no data

You can use a workaround, please add: -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 to your build command, it will force Maven to create new TCP connections from scratch every 2 min (with 4 min threshold).

For now I hope that you can try to implement Maven’s flags, it will slow a build process, but the time spent on TCP recreation will be tiny (not as much as 1% of total build time I guess).

The SNAT feature (I guess this feature is absent on current VMs): https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset

In future we'll discuss the SNAT properties with the team and will try to implement this feature. Thank you!

Ginxo commented 4 years ago

Thanks @LeonidLapshin https://issues.redhat.com/browse/BXMSPROD-996 created to test it. I let you know as soon as we test it

Ginxo commented 4 years ago

@LeonidLapshin the workaround seems to be working fine.

LeonidLapshin commented 4 years ago

@Ginxo, happy to hear it, please feel free to open a new ticket if the problem persist :)

lhotari commented 4 years ago

You can use a workaround, please add: -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 to your build command, it will force Maven to create new TCP connections from scratch every 2 min (with 4 min threshold).

@LeonidLapshin It might not be necessary to disable the maven connection pool completely. Setting -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 should be sufficient. If you disable the pool or keep alive, that setting will have no effect.

Apache Pulsar now uses https://github.com/apache/pulsar/blob/c5705f247f865b1a24f1309c22dc2d08fbba966a/.github/workflows/ci-unit.yaml#L29-L30

env:
  MAVEN_OPTS: -Dmaven.wagon.httpconnectionManager.ttlSeconds=25 -Dmaven.wagon.http.retryHandler.count=3

This seemed to resolve most of the maven connection timeout / reset issues .

However I just now noticed yet another connection reset when using MAVEN_OPTS=-Dmaven.wagon.httpconnectionManager.ttlSeconds=25 -Dmaven.wagon.http.retryHandler.count=3 https://github.com/apache/pulsar/runs/1325014611?check_suite_focus=true

Perhaps the recommendation should be -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false without specifying the -Dmaven.wagon.httpconnectionManager.ttlSeconds=120.

jeacott1 commented 2 years ago

@Ginxo this issue is not resolved for me - I have a large long running build, hundreds of modules. the suggested workaround does not work for me. indeed it has become noticeably worse in the past few months.

Ginxo commented 2 years ago

@jeacott1 could you please share your job URL? or to paste your job content? I have this working for almost two years for really huge maven builds, may be I can help :thinking:

jeacott1 commented 2 years ago

@Ginxo its a private repo - will a job url help? also - I'm running into gha stalling for hours like this

2022-05-12T05:04:56.2647854Z Waiting for a runner to pick up this job...
2022-05-12T05:04:56.9134609Z Job is waiting for a hosted runner to come online.
2022-05-12T05:04:58.9053850Z Job is about to start running on the hosted runner: Hosted Agent (hosted)

checking a failed run earlier today that I had to shoot -note 2.5 hours between the last maven log and me shooting it.

2022-05-12T02:38:22.0146228Z Downloading from central: https://repo1.maven.org/maven2/org/eclipse/platform/org.eclipse.core.contenttype/maven-metadata.xml
2022-05-12T02:38:22.0159911Z Downloading from nexus-3rd-party: http://172.22.22.9/nexus/content/repositories/thirdparty/org/eclipse/platform/org.eclipse.core.contenttype/maven-metadata.xml
2022-05-12T02:38:23.5242763Z Progress (1): 818 B
2022-05-12T05:04:50.1727931Z ***[error]The operation was canceled.

Ginxo commented 2 years ago

@jeacott1 you can share GHA workflow yaml content. Anyway those logs you shared does not seem to be related with maven timeout issue but with the job waiting for an available runner. This could be due to you already consumed your GHA quota. In this case you always the chance increase quota or to use your own runners, see https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners but this is a different topic.

jeacott1 commented 2 years ago

@Ginxo check above (sorry I edited it).

the config file is large, but this is the crux of it at the failure point (always in the mvn deploy ... ) fyi - this config used to work just fine - lately it doesnt work at all. I used to run it with mvn -T 8, but that seems to make things much worse. with -T 8 it used to take just 30 minutes for this to run.

on: 
  pull_request:
      types: [opened, synchronize, reopened]
  push:
    branches:
      - main

env:
    MAVEN_OPTS: >-
        -Dhttp.keepAlive=false
        -Dmaven.wagon.http.pool=false
        -Dmaven.wagon.httpconnectionManager.ttlSeconds=120
jobs:
  build:
  ...
      - name: Set up JDK 8
      uses: actions/setup-java@v2
      with:
        distribution: 'temurin'
        java-version: 8
        # cache: 'maven'

      - name: Get version
      id: get_version
      run: |
          VERSION=$( mvn -pl :generator -P !include-agg help:evaluate -Dexpression=project.version -Dbuild.num=${{steps.deploy_qual.outputs.value}} -q -DforceStdout --file ls-mvn/pom.xml )
          echo "::set-output name=version::$VERSION"

    - name: Get version coords
      id: get-coords
      run: |
          VERSION=$(bash -l -c 's="${{ steps.get_version.outputs.version }}"; echo ${s%-*}')
          echo "::set-output name=version::$VERSION"

    - name: Build models
      if:  (steps.deploy_qual.outputs.build == 'true')
      run: mvn -am -P generate-build -pl ":generator" -Dbuild.num=${{ steps.deploy_qual.outputs.value }}  compile  --file ls-mvn/pom.xml

    - name: Build with Maven
      if:  (steps.deploy_qual.outputs.build == 'true')
      run: mvn -T 1 deploy -Ddeploy.skip=false -Dmaven.install.skip=true -DfailIfNoTests=false  -Dbuild.num=${{ steps.deploy_qual.outputs.value }} --file ls-mvn/pom.xml

Ginxo commented 2 years ago

@jeacott1 thanks for sharing. I would say you error is not related with this topic. I suggest you open a new query/request/issue for github support.

jeacott1 commented 2 years ago

fwiw, removing the other suggested options and just setting -Dmaven.wagon.httpconnectionManager.ttlSeconds=60 has largely fixed my issue. the other options just break the thing altogether. @lhotari was right here I think. -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false aren't useful, and in my case make things worse.

actions / runner-images

Maven Connection timed out #1499