apache / incubator-kie-issues

Apache License 2.0
12 stars 1 forks source link

drools.weekly-deploy jobs frequently fail with Request Timeout (408) #1444

Open tkobayas opened 3 weeks ago

tkobayas commented 3 weeks ago

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/

07-14: SUCCESS 07-21: FAILURE 07-28: FAILURE 08-04: FAILURE 08-11: FAILURE 08-18: FAILURE

for example)


[2024-08-18T04:42:15.583Z] [INFO] Retrying deployment attempt 5 of 5
[2024-08-18T04:44:56.049Z] [WARNING] Failed to upload checksum to org/drools/drools-tms/999-20240818-SNAPSHOT/drools-tms-999-20240818-20240818.030803-1-sources.jar.sha1
[2024-08-18T04:44:56.049Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.handleStatus (HttpTransporter.java:619)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.execute (HttpTransporter.java:488)
[2024-08-18T04:44:56.049Z]     at org.eclipse.aether.transport.http.HttpTransporter.implPut (HttpTransporter.java:469)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.spi.connector.transport.AbstractTransporter.put (AbstractTransporter.java:107)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.uploadChecksum (BasicRepositoryConnector.java:608)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.uploadChecksums (BasicRepositoryConnector.java:591)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask (BasicRepositoryConnector.java:565)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run (BasicRepositoryConnector.java:414)
[2024-08-18T04:44:56.050Z]     at org.eclipse.aether.util.concurrency.RunnableErrorForwarder.lambda$wrap$0 (RunnableErrorForwarder.java:66)
[2024-08-18T04:44:56.050Z]     at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
[2024-08-18T04:44:56.050Z]     at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
[2024-08-18T04:44:56.050Z]     at java.lang.Thread.run (Thread.java:840)
...
tkobayas commented 3 weeks ago

10.0.x/other/drools.weekly-deploy jobs have the same issue, but now focus on main

tkobayas commented 3 weeks ago

thought)

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/nightly/job/drools.build-and-deploy/ nightly also does deploy. I see 08-17: SUCCESS 08-18: FAILURE (Request Timeout (408)) 08-19: SUCCESS 08-20: SUCCESS 08-21: SUCCESS

Hmm, Sunday night may cause a high-load (even within drools, both nightly and weekly did "deploy" around 4:00 AM on 08-18 ).

tkobayas commented 2 weeks ago

https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/14/

08-25: SUCCESS (65 WARNINGs in 5 attemps)

However, we still see lots of timeout WARNING and retrying.

Also I have a doubt if the configured 300 sec timeout was effective. See the log was within 120 sec.

[2024-08-25T05:24:54.309Z] [INFO] Retrying deployment attempt 4 of 5
[2024-08-25T05:26:31.776Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240825-SNAPSHOT/kie-core-bom-999-20240825-20240825.030947-1.pom.sha1
[2024-08-25T05:26:31.776Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-08-25T05:26:31.776Z]     at org.eclipse.aether.transport.http.HttpTransporter.handleStatus (HttpTransporter.java:619)
tkobayas commented 1 week ago

09-01: SUCCESS

3 WARNINGs in the 1st attempt. 2nd attempt successful.

(Note: Failed to upload checksum doesn't stop the whole task. Could not transfer artifact stops the task and triggers retrying)

[2024-09-01T04:17:31.407Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-01T04:26:48.086Z] [WARNING] Failed to upload checksum to org/drools/drools-examples/999-20240901-SNAPSHOT/drools-examples-999-20240901-20240901.030335-1-javadoc.jar.md5
[2024-09-01T04:26:48.086Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:26:48.086Z]     ...
[2024-09-01T04:28:36.241Z] [WARNING] Failed to upload checksum to org/drools/kiebase-inclusion/999-20240901-SNAPSHOT/kiebase-inclusion-999-20240901-20240901.030335-1-tests.jar.md5
[2024-09-01T04:28:36.241Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:28:36.241Z]     ...
[2024-09-01T04:46:57.694Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:drools-canonical-model:jar:999-20240901-20240901.030335-1 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:46:57.694Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T05:20:53.254Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T05:20:53.254Z] [INFO] Reactor Summary for Drools :: Parent 999-20240901-SNAPSHOT:

This change aether.connector.basic.parallelPut=false seemed to be effective, but let's see next week.

tkobayas commented 5 days ago

on 09-05, Jan and Rodrigo manually triggered the job.

09-05 (1st): SUCCESS. No WARNING 09-05 (2nd): Upload was successful. No WARNING. The job failed because of the duplicate tag name, not related to uploading.

09-08: FAILURE. The job was cancelled because of job time out (3 hours). The job was in the middle of 2nd attempt of uploading. 1st uploading hit 30 WARNINGs.

[2024-09-08T03:02:07.906Z] Timeout set to expire in 3 hr 0 min
...
[2024-09-08T04:37:57.001Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-08T04:39:20.860Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240908-SNAPSHOT/kie-core-bom-999-20240908-20240908.030343-1.pom.md5
[2024-09-08T04:39:20.860Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
...
[2024-09-08T05:52:38.558Z] [INFO] Retrying deployment attempt 2 of 5
...
[2024-09-08T06:02:07.906Z] Cancelling nested steps due to timeout

With aether.connector.basic.parallelPut=false, usually one round of uploading all artifacts takes around 35 minutes (e.g. 09-01 Sunday). But on 09-08, the 1st attempt took around 75 minutes. The network was probably unusually unstable.

So far, aether.connector.basic.parallelPut=false seems to have a positive effect, but not yet perfect.

How to improve further? A) Increase job timeout : But note that drools weekly deployment is a dependency of other projects weekly deployment. B) Disable deployAtEnd

jstastny-cz commented 5 days ago

Please also discuss this on Mailing list, because -DdeployAtEnd was a decision taken to unify how we deploy things across KIE project, this would again deviate from that goal, see https://lists.apache.org/thread/d6oxh6qtm6mm4hc2zv1pwcqqb2kfmv70

tkobayas commented 5 days ago

Sorry that I missed the discussion, @jstastny-cz . I'll not push the solution "Disable deployAtEnd". Rather, I'll see the timeout trend for some while.

jstastny-cz commented 5 days ago

What I don't understand - why nightly deploy takes minutes and weekly hours. I think we can compare the maven commands used between the 2 and check if they differ in significant aspects.

tkobayas commented 1 day ago

Hi @jstastny-cz ,

nightly

mvn dependency:tree clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/nightly/drools.build-and-deploy@tmp/config17784539189338421883tmp -Dmaven.wagon.http.ssl.insecure=true -Dmaven.test.failure.ignore=true -nsu -ntp -fae -e -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 -Dmaven.wagon.http.retryHandler.count=3 -Dfull -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -B

weekly

mvn -B -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/other/drools.weekly-deploy@tmp/config6925231661979507139tmp -fae -ntp -Dfull clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -Daether.connector.basic.parallelPut=false -Dfull -Dmaven.test.failure.ignore=true -DskipTests=false

nightly has:

Per my understandings, wagon is no longer used by default (since maven 3.9.0). https://stackoverflow.com/questions/71099771/how-do-i-use-transport-http-instead-of-wagon-in-maven

-Dhttp.keepAlive=false has pros and cons. It may be good under unstable network environment.

weekyly has:


Btw, I think day of the week and time seems to matter.

nightly 09-01 (Sunday) was slow and unstable.

[2024-09-01T04:09:11.322Z] [INFO] --- install:3.1.1:install (default-install) @ drools-distribution ---
[2024-09-01T04:15:13.638Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 502, reason phrase: Proxy Error (502)
[2024-09-01T04:15:13.639Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T04:21:48.410Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:kie-pmml-evaluator-api:pom:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:21:48.410Z] [INFO] Retrying deployment attempt 3 of 5
...
[2024-09-01T04:30:06.008Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools.testcoverage:test-integration-ruleunits-tests:jar:tests:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:30:06.008Z] [INFO] Retrying deployment attempt 4 of 5
[2024-09-01T04:36:40.259Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:efesto-compilation-manager-core:jar:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:36:40.259Z] [INFO] Retrying deployment attempt 5 of 5
...
[2024-09-01T04:43:04.104Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T04:43:04.104Z] [INFO] Reactor Summary for Drools :: Parent 999-SNAPSHOT:
...
[2024-09-01T04:43:04.108Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:3.1.1:deploy (default-deploy) on project drools-distribution: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapsh\
ots): status code: 502, reason phrase: Proxy Error (502) -> [Help 1]

and weekly 09-12 (Thursday) manually triggered by Rodrigo was successful without timeout.

[2024-09-12T12:18:06.159Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-12T12:50:46.172Z] [INFO] ------------------------------------------------------------------------
...
[2024-09-12T12:50:46.176Z] [INFO] BUILD SUCCESS

I guess not only KIE projects but also many other projects in apache contribute to this "unstable Sunday night" (I don't know if we have CPU/Network quota). Imagine that many projects do nightly deployment every night and also weekly deployment on Sunday night, the load would be double on Sunday night.

So... how about moving the weekly build to Saturday daytime or Sunday daytime (or weekday daytime)? Do you think it's a good idea?