GoogleCloudPlatform / artifact-registry-maven-tools

Apache License 2.0
22 stars 23 forks source link

maven wagon does not retry on network errors #49

Closed JohanLorenzo closed 2 years ago

JohanLorenzo commented 3 years ago

Hi there!

I stumbled upon this error:

[ERROR] Failed to execute goal on project some-maven-module: Could not resolve dependencies for project some.groupId:some-maven-module:war:1.0.0-SNAPSHOT: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.: Read timed out -> [Help 1]

The root cause is a network error. I think the maven wagon would benefit from an exponential backoff[1]. I didn't anything related in the wagon except retries are supported[2]. I don't know what this means in the context of maven.

Full stacktace:

[INFO] ArtifactRegistry Maven Wagon: Retrieving credentials...
[INFO] Trying Application Default Credentials...
[INFO] Using Application Default Credentials.
[INFO] ArtifactRegistry Maven Wagon: Retrieving credentials...
[INFO] Trying Application Default Credentials...
[INFO] Using Application Default Credentials.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:10 min (Wall Clock)
[INFO] Finished at: 2021-07-01T09:55:51Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project some-maven-module: Could not resolve dependencies for project some.groupId:some-maven-module:war:1.0.0-SNAPSHOT: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.: Read timed out -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project some-maven-module: Could not resolve dependencies for project some.groupId:some-maven-module:war:1.0.0-SNAPSHOT: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies (LifecycleDependencyResolver.java:269)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies (LifecycleDependencyResolver.java:147)
    at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved (MojoExecutor.java:248)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:202)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project some.groupId:some-maven-module:war:1.0.0-SNAPSHOT: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.
    at org.apache.maven.project.DefaultProjectDependenciesResolver.resolve (DefaultProjectDependenciesResolver.java:209)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies (LifecycleDependencyResolver.java:243)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies (LifecycleDependencyResolver.java:147)
    at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved (MojoExecutor.java:248)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:202)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: org.eclipse.aether.resolution.DependencyResolutionException: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.
    at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies (DefaultRepositorySystem.java:357)
    at org.apache.maven.project.DefaultProjectDependenciesResolver.resolve (DefaultProjectDependenciesResolver.java:202)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies (LifecycleDependencyResolver.java:243)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies (LifecycleDependencyResolver.java:147)
    at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved (MojoExecutor.java:248)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:202)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: org.eclipse.aether.resolution.ArtifactResolutionException: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolve (DefaultArtifactResolver.java:424)
    at org.eclipse.aether.internal.impl.DefaultArtifactResolver.resolveArtifacts (DefaultArtifactResolver.java:229)
    at org.eclipse.aether.internal.impl.DefaultRepositorySystem.resolveDependencies (DefaultRepositorySystem.java:340)
    at org.apache.maven.project.DefaultProjectDependenciesResolver.resolve (DefaultProjectDependenciesResolver.java:202)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies (LifecycleDependencyResolver.java:243)
    at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.resolveProjectDependencies (LifecycleDependencyResolver.java:147)
    at org.apache.maven.lifecycle.internal.MojoExecutor.ensureDependenciesAreResolved (MojoExecutor.java:248)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:202)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:190)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:186)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:515)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact some.groupId:some.artifactId:jar:1.0.0-20210630.130551-52 from/to artifact-registry (artifactregistry://europe-west1-maven.pkg.dev/some-gcp-project/some-maven-registry): Failed to send request to remote server.
    at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed (ArtifactTransportListener.java:52)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:369)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: org.apache.maven.wagon.TransferFailedException: Failed to send request to remote server.
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.getInputStream (ArtifactRegistryWagon.java:71)
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.getIfNewer (ArtifactRegistryWagon.java:126)
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.get (ArtifactRegistryWagon.java:98)
    at org.eclipse.aether.transport.wagon.WagonTransporter$GetTaskRunner.run (WagonTransporter.java:567)
    at org.eclipse.aether.transport.wagon.WagonTransporter.execute (WagonTransporter.java:435)
    at org.eclipse.aether.transport.wagon.WagonTransporter.get (WagonTransporter.java:412)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:457)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:364)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0 (Native Method)
    at java.net.SocketInputStream.socketRead (SocketInputStream.java:115)
    at java.net.SocketInputStream.read (SocketInputStream.java:168)
    at java.net.SocketInputStream.read (SocketInputStream.java:140)
    at sun.security.ssl.SSLSocketInputRecord.read (SSLSocketInputRecord.java:478)
    at sun.security.ssl.SSLSocketInputRecord.readHeader (SSLSocketInputRecord.java:472)
    at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket (SSLSocketInputRecord.java:70)
    at sun.security.ssl.SSLSocketImpl.readApplicationRecord (SSLSocketImpl.java:1364)
    at sun.security.ssl.SSLSocketImpl$AppInputStream.read (SSLSocketImpl.java:973)
    at java.io.BufferedInputStream.fill (BufferedInputStream.java:252)
    at java.io.BufferedInputStream.read1 (BufferedInputStream.java:292)
    at java.io.BufferedInputStream.read (BufferedInputStream.java:351)
    at sun.net.www.http.HttpClient.parseHTTPHeader (HttpClient.java:754)
    at sun.net.www.http.HttpClient.parseHTTP (HttpClient.java:689)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0 (HttpURLConnection.java:1615)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1520)
    at java.net.HttpURLConnection.getResponseCode (HttpURLConnection.java:527)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode (HttpsURLConnectionImpl.java:334)
    at com.google.api.client.http.javanet.NetHttpResponse.<init> (NetHttpResponse.java:36)
    at com.google.api.client.http.javanet.NetHttpRequest.execute (NetHttpRequest.java:144)
    at com.google.api.client.http.javanet.NetHttpRequest.execute (NetHttpRequest.java:79)
    at com.google.api.client.http.HttpRequest.execute (HttpRequest.java:995)
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.getInputStream (ArtifactRegistryWagon.java:64)
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.getIfNewer (ArtifactRegistryWagon.java:126)
    at com.google.cloud.artifactregistry.wagon.ArtifactRegistryWagon.get (ArtifactRegistryWagon.java:98)
    at org.eclipse.aether.transport.wagon.WagonTransporter$GetTaskRunner.run (WagonTransporter.java:567)
    at org.eclipse.aether.transport.wagon.WagonTransporter.execute (WagonTransporter.java:435)
    at org.eclipse.aether.transport.wagon.WagonTransporter.get (WagonTransporter.java:412)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$GetTaskRunner.runTask (BasicRepositoryConnector.java:457)
    at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:364)
    at org.eclipse.aether.util.concurrency.RunnableErrorForwarder$1.run (RunnableErrorForwarder.java:75)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:628)
    at java.lang.Thread.run (Thread.java:829)
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

[1] https://cloud.google.com/iot/docs/how-tos/exponential-backoff [2] https://github.com/GoogleCloudPlatform/artifact-registry-maven-tools/blob/c959d88dda3014ae4fe6dfb1ca935db6c33835a9/artifactregistry-maven-wagon/src/main/java/com/google/cloud/artifactregistry/wagon/ArtifactRegistryWagon.java#L160

JohanLorenzo commented 3 years ago

What do you think @Gridasov @lindsayismith?

lindsayismith commented 3 years ago

I agree in principle but I don't know if wagons are supposed to do this - like does Maven already do any retrying at all at the layer above? Or are we supposed to do that ourselves here?

JohanLorenzo commented 3 years ago

Hi @lindsayismith!

Thank you very much for your rapid answer! These are very good questions πŸ‘ I confess I hadn't thought of that when I filed this ticket πŸ˜…

For the context, I'm no maven expert. Everything I'm going to say comes from researches I did this morning. I might misunderstand some things and I'm more than happy to discuss them πŸ™‚

TL;DR: I believe it should happen in wagon because the Artifact Registry Wagon depends on AbstractWagon[1].

There is a retry mechanism in the maven resolver[2]. That said, it's only triggered when a checksum error happens. It kind of makes sense to me because the resolver doesn't know which wagon it used, at this stage. So, if it used a wagon that doesn't open any network connection (like MemWagon[3], if I'm not mistaken), then there would be no point in catching TransferFailedException.

In order to validate my understanding, I took a look at the HttpWagon. TransferFailedException is raised at many places, including when something goes wrong on the socket level[4]. The good news is: HttpWagon depends on AbstractHttpClientWagon[5], which does provide a retry mechanism[6], which can be configured via system properties[7].

Hence, I think the Artifact Registry Wagon should:

  1. either implement its own retry mechanism, just like Maven's HttpWagon does
  2. or inherit from AbstractHttpClientWagon to get this feature "for free"

I get the feeling option 2 wouldn't be trivial. Does option 1 sound reasonable to you, @lindsayismith?

[1] https://github.com/GoogleCloudPlatform/artifact-registry-maven-tools/blob/d419bd303fce4da79bd193c840caa899845b9f90/artifactregistry-maven-wagon/src/main/java/com/google/cloud/artifactregistry/wagon/ArtifactRegistryWagon.java#L50 [2] https://github.com/apache/maven-resolver/blob/17ea285d40e8cdd183d45b074f74aae6d9f3d3fc/maven-resolver-connector-basic/src/main/java/org/eclipse/aether/connector/basic/BasicRepositoryConnector.java#L466 [3] https://github.com/apache/maven-resolver/blob/17ea285d40e8cdd183d45b074f74aae6d9f3d3fc/maven-resolver-transport-wagon/src/test/java/org/eclipse/aether/transport/wagon/MemWagon.java#L45 [4] https://github.com/apache/maven-wagon/blob/ab39b7209f121fd4c4025447efb90d39f0ce89f6/wagon-providers/wagon-http/src/main/java/org/apache/maven/wagon/providers/http/HttpWagon.java#L125 [5] https://github.com/apache/maven-wagon/blob/ab39b7209f121fd4c4025447efb90d39f0ce89f6/wagon-providers/wagon-http/src/main/java/org/apache/maven/wagon/providers/http/HttpWagon.java#L47 [6] https://github.com/apache/maven-wagon/blob/ab39b7209f121fd4c4025447efb90d39f0ce89f6/wagon-providers/wagon-http-shared/src/main/java/org/apache/maven/wagon/shared/http/AbstractHttpClientWagon.java#L446 [7] https://maven.apache.org/wagon/wagon-providers/wagon-http/ https://github.com/apache/maven-wagon/blob/ab39b7209f121fd4c4025447efb90d39f0ce89f6/wagon-providers/wagon-http/src/site/apt/index.apt#L63

yihanzhen commented 2 years ago

According to https://stackoverflow.com/questions/14399619/maven-build-failure-read-time-out it seems Read timed out happens when maven couldn't download the artifact due to network issues. @JohanLorenzo would you mind sharing the artifact name you were trying to upload (it seems these were obfuscated in the logs you provided) and possibly the dependency tree of it? I wanted to check what caused the time out first before we decide to add retries. For example, if the time out is indeed a 504, then adding retries most likely won't work.

Thanks!

JohanLorenzo commented 2 years ago

Hello @yihanzhen!

Thank you for your answer and your investigation! πŸ˜ƒ

I'm sorry, I don't have the logs anymore. We have a 30-day-retention-policy. I'll have to reproduce the problem to get you some more. That said, I can say it was an intermittent bug (I'm sorry, I should have said it explicitly in the first message). I remember rerunning the job and it turned green.

I'd love to provide more context about my request πŸ™‚ I've worked on CI systems at scale for the past 6 years. My previous company had tens of thousands jobs running at the same time and requesting the same HTTP resources. At this scale, network errors become visible πŸ˜…, even though we're technically in the same datacenter, with a network config that is meant to let these machines constantly talk to each other. The approach we used was to retry on HTTP 5XX errors. It saved us many hours of human work. Before then, someone had to manually rerun failed jobs. Of course, jobs did fail when long outages occurred. But at least, it covered us whenever small blips happened.

So, in my opinion, network issues are something we have to live with because the root cause may be at many different places, many of which we don't have any control on (e.g.: the underlying OS or one of routers along the way). Retries are a way circumvent this class of bugs. I personally think 504 errors are worth retrying too because we don't know what caused this instance of gateway timeout.

If you wish to, we can have a chat on a video call. This way, it'll be easier to explain each other's point of view πŸ™‚

yihanzhen commented 2 years ago

@JohanLorenzo Thanks for using Artifact Registry!

There's another fix for Read timed out that might have solved your problem #53, and version 2.1.2 is released with that fix. Feel free to let us know if you run a similar problem in the future.

We would like to keep the discussion on Github if possible so that everyone can contribute to and benefit from the questions (plus I'm a shy person :stuck_out_tongue: ). If you need to share private data with us feel free to email ar-contact@google.com or contact the support team.

Since no more actions are needed on this issue, I'll close it here.

JohanLorenzo commented 2 years ago

πŸ‘‹ @yihanzhen

Okay, that sounds good for now. Thanks for fixing this root cause! Worst case scenario, I'll ask to reopen this ticket if we see this error again. I'll keep you tuned.

Ojaschopade commented 1 year ago

Hello @JohanLorenzo ,

Could you please provide the flags or system properties used to retry on 5XX errors in maven? I saw some options here https://stackoverflow.com/questions/14399619/maven-build-failure-read-time-out but didnt observe any change in behavior of maven build after adding those.

Thank you in advance!

JohanLorenzo commented 1 year ago

Hi @Ojaschopade! I'm sorry, I don't have access to this infra anymore. I don't remember enough of what happened, either πŸ˜… Based on my last comment, things were likely fine after the patches. Otherwise, I would have reopened the ticket.