Open tkobayas opened 2 months ago
10.0.x/other/drools.weekly-deploy
jobs have the same issue, but now focus on main
thought)
https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/nightly/job/drools.build-and-deploy/ nightly also does deploy. I see 08-17: SUCCESS 08-18: FAILURE (Request Timeout (408)) 08-19: SUCCESS 08-20: SUCCESS 08-21: SUCCESS
Hmm, Sunday night may cause a high-load (even within drools, both nightly and weekly did "deploy" around 4:00 AM on 08-18 ).
https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/14/
08-25: SUCCESS (65 WARNINGs in 5 attemps)
However, we still see lots of timeout WARNING and retrying.
Also I have a doubt if the configured 300 sec timeout was effective. See the log was within 120 sec.
[2024-08-25T05:24:54.309Z] [INFO] Retrying deployment attempt 4 of 5
[2024-08-25T05:26:31.776Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240825-SNAPSHOT/kie-core-bom-999-20240825-20240825.030947-1.pom.sha1
[2024-08-25T05:26:31.776Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-08-25T05:26:31.776Z] at org.eclipse.aether.transport.http.HttpTransporter.handleStatus (HttpTransporter.java:619)
09-01: SUCCESS
3 WARNINGs in the 1st attempt. 2nd attempt successful.
(Note: Failed to upload checksum
doesn't stop the whole task. Could not transfer artifact
stops the task and triggers retrying)
[2024-09-01T04:17:31.407Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-01T04:26:48.086Z] [WARNING] Failed to upload checksum to org/drools/drools-examples/999-20240901-SNAPSHOT/drools-examples-999-20240901-20240901.030335-1-javadoc.jar.md5
[2024-09-01T04:26:48.086Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:26:48.086Z] ...
[2024-09-01T04:28:36.241Z] [WARNING] Failed to upload checksum to org/drools/kiebase-inclusion/999-20240901-SNAPSHOT/kiebase-inclusion-999-20240901-20240901.030335-1-tests.jar.md5
[2024-09-01T04:28:36.241Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:28:36.241Z] ...
[2024-09-01T04:46:57.694Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:drools-canonical-model:jar:999-20240901-20240901.030335-1 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:46:57.694Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T05:20:53.254Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T05:20:53.254Z] [INFO] Reactor Summary for Drools :: Parent 999-20240901-SNAPSHOT:
This change aether.connector.basic.parallelPut=false
seemed to be effective, but let's see next week.
on 09-05, Jan and Rodrigo manually triggered the job.
09-05 (1st): SUCCESS. No WARNING 09-05 (2nd): Upload was successful. No WARNING. The job failed because of the duplicate tag name, not related to uploading.
09-08: FAILURE. The job was cancelled because of job time out (3 hours). The job was in the middle of 2nd attempt of uploading. 1st uploading hit 30 WARNINGs.
[2024-09-08T03:02:07.906Z] Timeout set to expire in 3 hr 0 min
...
[2024-09-08T04:37:57.001Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-08T04:39:20.860Z] [WARNING] Failed to upload checksum to org/kie/kie-core-bom/999-20240908-SNAPSHOT/kie-core-bom-999-20240908-20240908.030343-1.pom.md5
[2024-09-08T04:39:20.860Z] org.apache.http.client.HttpResponseException: status code: 408, reason phrase: Request Timeout (408)
...
[2024-09-08T05:52:38.558Z] [INFO] Retrying deployment attempt 2 of 5
...
[2024-09-08T06:02:07.906Z] Cancelling nested steps due to timeout
With aether.connector.basic.parallelPut=false
, usually one round of uploading all artifacts takes around 35 minutes (e.g. 09-01 Sunday). But on 09-08, the 1st attempt took around 75 minutes. The network was probably unusually unstable.
So far, aether.connector.basic.parallelPut=false
seems to have a positive effect, but not yet perfect.
How to improve further?
A) Increase job timeout : But note that drools weekly deployment is a dependency of other projects weekly deployment.
B) Disable deployAtEnd
Please also discuss this on Mailing list, because -DdeployAtEnd
was a decision taken to unify how we deploy things across KIE project, this would again deviate from that goal, see https://lists.apache.org/thread/d6oxh6qtm6mm4hc2zv1pwcqqb2kfmv70
Sorry that I missed the discussion, @jstastny-cz . I'll not push the solution "Disable deployAtEnd
". Rather, I'll see the timeout trend for some while.
What I don't understand - why nightly deploy takes minutes and weekly hours. I think we can compare the maven commands used between the 2 and check if they differ in significant aspects.
Hi @jstastny-cz ,
nightly
mvn dependency:tree clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/nightly/drools.build-and-deploy@tmp/config17784539189338421883tmp -Dmaven.wagon.http.ssl.insecure=true -Dmaven.test.failure.ignore=true -nsu -ntp -fae -e -Dhttp.keepAlive=false -Dmaven.wagon.http.pool=false -Dmaven.wagon.httpconnectionManager.ttlSeconds=120 -Dmaven.wagon.http.retryHandler.count=3 -Dfull -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -B
weekly
mvn -B -s /home/jenkins/jenkins-agent/workspace/KIE/drools/main/other/drools.weekly-deploy@tmp/config6925231661979507139tmp -fae -ntp -Dfull clean deploy -DdeployAtEnd -Dapache.repository.username=**** -Dapache.repository.password=**** -DretryFailedDeploymentCount=5 -Daether.connector.basic.parallelPut=false -Dfull -Dmaven.test.failure.ignore=true -DskipTests=false
nightly has:
-Dmaven.wagon.http.ssl.insecure=true
-Dmaven.wagon.http.pool=false
-Dmaven.wagon.httpconnectionManager.ttlSeconds=120
-Dmaven.wagon.http.retryHandler.count=3
-Dhttp.keepAlive=false
Per my understandings, wagon is no longer used by default (since maven 3.9.0). https://stackoverflow.com/questions/71099771/how-do-i-use-transport-http-instead-of-wagon-in-maven
-Dhttp.keepAlive=false
has pros and cons. It may be good under unstable network environment.
weekyly has:
-Daether.connector.basic.parallelPut=false
was added by https://github.com/apache/incubator-kie-drools/pull/6056 . It seems to have a positive effect to avoid the timeout while it's slower than default. If we can avoid the timeout with a different approach, we can remove this option.Btw, I think day of the week and time seems to matter.
nightly 09-01 (Sunday) was slow and unstable.
[2024-09-01T04:09:11.322Z] [INFO] --- install:3.1.1:install (default-install) @ drools-distribution ---
[2024-09-01T04:15:13.638Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 502, reason phrase: Proxy Error (502)
[2024-09-01T04:15:13.639Z] [INFO] Retrying deployment attempt 2 of 5
[2024-09-01T04:21:48.410Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:kie-pmml-evaluator-api:pom:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:21:48.410Z] [INFO] Retrying deployment attempt 3 of 5
...
[2024-09-01T04:30:06.008Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.drools.testcoverage:test-integration-ruleunits-tests:jar:tests:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:30:06.008Z] [INFO] Retrying deployment attempt 4 of 5
[2024-09-01T04:36:40.259Z] [WARNING] Encountered issue during deployment: Failed to deploy artifacts: Could not transfer artifact org.kie:efesto-compilation-manager-core:jar:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): status code: 408, reason phrase: Request Timeout (408)
[2024-09-01T04:36:40.259Z] [INFO] Retrying deployment attempt 5 of 5
...
[2024-09-01T04:43:04.104Z] [INFO] ------------------------------------------------------------------------
[2024-09-01T04:43:04.104Z] [INFO] Reactor Summary for Drools :: Parent 999-SNAPSHOT:
...
[2024-09-01T04:43:04.108Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:3.1.1:deploy (default-deploy) on project drools-distribution: Failed to deploy artifacts: Could not transfer artifact org.drools:kiemodulemodel-example:jar:javadoc:999-20240901.023157-65 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapsh\
ots): status code: 502, reason phrase: Proxy Error (502) -> [Help 1]
and weekly 09-12 (Thursday) manually triggered by Rodrigo was successful without timeout.
[2024-09-12T12:18:06.159Z] [INFO] --- deploy:3.1.1:deploy (default-deploy) @ drools-distribution ---
[2024-09-12T12:50:46.172Z] [INFO] ------------------------------------------------------------------------
...
[2024-09-12T12:50:46.176Z] [INFO] BUILD SUCCESS
I guess not only KIE projects but also many other projects in apache contribute to this "unstable Sunday night" (I don't know if we have CPU/Network quota). Imagine that many projects do nightly deployment every night and also weekly deployment on Sunday night, the load would be double on Sunday night.
So... how about moving the weekly build to Saturday daytime or Sunday daytime (or weekday daytime)? Do you think it's a good idea?
https://ci-builds.apache.org/job/KIE/job/drools/job/main/job/other/job/drools.weekly-deploy/
07-14: SUCCESS 07-21: FAILURE 07-28: FAILURE 08-04: FAILURE 08-11: FAILURE 08-18: FAILURE
for example)