GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

[oozie] Enable oozie on 2.1 images #1068

Closed cjac closed 1 year ago

cjac commented 1 year ago
* detect version of Dataproc image
* clean up find -delete command
* use non-deprecated version of dfsadmin
* copy dataproc version specific curator jars to /usr/lib/oozie/lib
* update Copyright for oozie/oozie.sh
* Install fluentd configuration
* Accepting many more configuration options
* Setting default MySQL credentials
* retries are delayed after the first
* configure more default properties
* improved support for rocky8 images
* handling some race conditions
* pre-populate sharelib
* proactive log4j 1.2 mitigations
cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

passing: 1.5-rocky8 2.0-rocky8 2.1-rocky8

failing: 1.5-debian10

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

passing: 1.5-rocky8 2.0-rocky8 2.1-rocky8 2.1-ubuntu20 2.1-debian11 1.5-ubuntu18 2.0-ubuntu18 1.5-debian10

failing: 2.0-debian10

cjac commented 1 year ago

2023-07-11T18:37:35.361698945Z + echo -e '\nStarting validation on test-oozie-ha-2-0-20230711-182856-8iin-m-2:' 2023-07-11T18:37:35.361706478Z + oozie admin -sharelibupdate 2023-07-11T18:37:35.361731919Z Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions ]. Trying after 1 sec. Retry count = 1 2023-07-11T18:37:35.361744021Z Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions ]. Trying after 2 sec. Retry count = 2 2023-07-11T18:37:35.369534342Z Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions ]. Trying after 4 sec. Retry count = 3 2023-07-11T18:37:35.369607103Z Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions ]. Trying after 8 sec. Retry count = 4 2023-07-11T18:37:35.369617997Z java.net.ConnectException: Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions 2023-07-11T18:37:35.369626864Z at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 2023-07-11T18:37:35.369651197Z at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 2023-07-11T18:37:35.369659917Z at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 2023-07-11T18:37:35.369666923Z at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 2023-07-11T18:37:35.369674404Z at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.wrapExceptionWithMessage(KerberosAuthenticator.java:232) 2023-07-11T18:37:35.369681341Z at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:216) 2023-07-11T18:37:35.369688102Z at org.apache.oozie.client.AuthOozieClient.createConnection(AuthOozieClient.java:197) 2023-07-11T18:37:35.369696248Z at org.apache.oozie.client.OozieClient$1.doExecute(OozieClient.java:515) 2023-07-11T18:37:35.369704258Z at org.apache.oozie.client.retry.ConnectionRetriableClient.execute(ConnectionRetriableClient.java:44) 2023-07-11T18:37:35.369712300Z at org.apache.oozie.client.OozieClient.createRetryableConnection(OozieClient.java:517) 2023-07-11T18:37:35.369751478Z at org.apache.oozie.client.OozieClient.getSupportedProtocolVersions(OozieClient.java:397) 2023-07-11T18:37:35.369760504Z at org.apache.oozie.client.OozieClient.validateWSVersion(OozieClient.java:357) 2023-07-11T18:37:35.369768323Z at org.apache.oozie.client.OozieClient.createURL(OozieClient.java:468) 2023-07-11T18:37:35.369775043Z at org.apache.oozie.client.OozieClient.access$000(OozieClient.java:88) 2023-07-11T18:37:35.369782420Z at org.apache.oozie.client.OozieClient$ClientCallable.call(OozieClient.java:562) 2023-07-11T18:37:35.369790100Z at org.apache.oozie.client.OozieClient.updateShareLib(OozieClient.java:2162) 2023-07-11T18:37:35.369797391Z at org.apache.oozie.cli.OozieCLI.adminCommand(OozieCLI.java:2032) 2023-07-11T18:37:35.369820361Z at org.apache.oozie.cli.OozieCLI.processCommand(OozieCLI.java:733) 2023-07-11T18:37:35.369828963Z at org.apache.oozie.cli.OozieCLI.run(OozieCLI.java:682) 2023-07-11T18:37:35.369836361Z at org.apache.oozie.cli.OozieCLI.main(OozieCLI.java:245) 2023-07-11T18:37:35.369844056Z Caused by: java.net.ConnectException: Connection refused (Connection refused) 2023-07-11T18:37:35.369860519Z at java.net.PlainSocketImpl.socketConnect(Native Method) 2023-07-11T18:37:35.369869668Z at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 2023-07-11T18:37:35.369877625Z at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) 2023-07-11T18:37:35.369885310Z at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 2023-07-11T18:37:35.369919624Z at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 2023-07-11T18:37:35.369928124Z at java.net.Socket.connect(Socket.java:607) 2023-07-11T18:37:35.369936240Z at java.net.Socket.connect(Socket.java:556) 2023-07-11T18:37:35.369944737Z at sun.net.NetworkClient.doConnect(NetworkClient.java:180) 2023-07-11T18:37:35.369952861Z at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) 2023-07-11T18:37:35.369960558Z at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) 2023-07-11T18:37:35.369968193Z at sun.net.www.http.HttpClient.(HttpClient.java:242) 2023-07-11T18:37:35.369992230Z at sun.net.www.http.HttpClient.New(HttpClient.java:339) 2023-07-11T18:37:35.370000590Z at sun.net.www.http.HttpClient.New(HttpClient.java:357) 2023-07-11T18:37:35.370008248Z at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1228) 2023-07-11T18:37:35.370015854Z at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162) 2023-07-11T18:37:35.370023390Z at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056) 2023-07-11T18:37:35.370031234Z at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990) 2023-07-11T18:37:35.370038995Z at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) 2023-07-11T18:37:35.370046632Z ... 14 more 2023-07-11T18:37:35.370069121Z Error: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 4. Exception = Error while authenticating with endpoint: http://test-oozie-ha-2-0-20230711-182856-8iin-m-2.c.cloud-dataproc-ci.internal:11000/oozie/versions 2023-07-11T18:37:35.370077305Z 2023-07-11T18:37:35.370085067Z Recommendation: To check for possible causes of SSH connectivity issues and get 2023-07-11T18:37:35.370114953Z recommendations, rerun the ssh command with the --troubleshoot option. 2023-07-11T18:37:35.370122876Z 2023-07-11T18:37:35.370130378Z gcloud compute ssh test-oozie-ha-2-0-20230711-182856-8iin-m-2 --project=cloud-dataproc-ci --zone=us-central1-f --troubleshoot 2023-07-11T18:37:35.370157336Z 2023-07-11T18:37:35.370165621Z Or, to investigate an IAP tunneling issue: 2023-07-11T18:37:35.370173067Z 2023-07-11T18:37:35.370181321Z gcloud compute ssh test-oozie-ha-2-0-20230711-182856-8iin-m-2 --project=cloud-dataproc-ci --zone=us-central1-f --troubleshoot --tunnel-through-iap 2023-07-11T18:37:35.370188376Z 2023-07-11T18:37:35.370195766Z ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]. 2023-07-11T18:37:35.370202764Z 2023-07-11T18:37:35.370209699Z 2023-07-11T18:37:35.370217167Z ---------------------------------------------------------------------- 2023-07-11T18:37:35.370239956Z Ran 1 test in 534.694s 2023-07-11T18:37:35.370247290Z 2023-07-11T18:37:35.370255487Z FAILED (failures=1) 2023-07-11T18:37:35.370262968Z ================================================================================ 2023-07-11T18:37:35.561290979Z Target //oozie:test_oozie up-to-date: 2023-07-11T18:37:35.565725757Z bazel-bin/oozie/test_oozie 2023-07-11T18:37:36.683784446Z INFO: Elapsed time: 1767.117s, Critical Path: 1720.94s 2023-07-11T18:37:36.763499803Z INFO: 8 processes: 3 internal, 5 local. 2023-07-11T18:37:37.169522741Z //oozie:test_oozie FAILED in 3 out of 5 in 667.7s 2023-07-11T18:37:37.169591146Z Stats over 5 runs: max = 667.7s, min = 493.3s, avg = 576.2s, dev = 65.3s 2023-07-11T18:37:37.172664632Z /home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/testlogs/oozie/test_oozie/shard_3_of_3/test.log 2023-07-11T18:37:37.174834419Z /home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/testlogs/oozie/test_oozie/shard_3_of_3/test_attempts/attempt_1.log 2023-07-11T18:37:37.176742473Z /home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/testlogs/oozie/test_oozie/shard_3_of_3/test_attempts/attempt_2.log 2023-07-11T18:37:37.179396842Z 2023-07-11T18:37:37.187063574Z Executed 1 out of 1 test: 1 fails locally.

cjac commented 1 year ago

I can reproduce this

Starting validation on cluster-1676904778-m-2:
+ oozie admin -sharelibupdate  
Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions ]. 
Trying after 1 sec. Retry count = 1
Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions ]. 
Trying after 2 sec. Retry count = 2
Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions ]. Trying after 4 sec. Retry count = 3
Connection exception has occurred [ java.net.ConnectException Error while authenticating with endpoint: http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions ]. Trying after 8 sec. Retry count = 4
java.net.ConnectException: Error while authenticating with endpoint: http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions
cjac commented 1 year ago

I think the endpoint host needs to be the cluster name not the master node name.

in http://cluster-1676904778-m-2.c.cjac-2021-00.internal:11000/oozie/versions, the value cluster-1676904778-m-2 should instead be cluster-1676904778 I believe.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

well, that unexpectedly worked.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

That one worked for all but 2.1-ubuntu20.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

I enabled rocky8 tests in the last run. 2.0 passed, but 2.1 is missing unzip and 1.5 didn't finish before the failure.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

Okay. Kuldeep, if you can give me an LGTM, we can get this merged!

I didn't include rocky support for this PR. I think we can get it working in another one soon.

kuldeepkk-dev commented 1 year ago

Tested this manually on dataproc 2.1-debian11 and zookeeper is failing to restart properly. Due to which init action is failing followed by cluster creation failures.

Line 577 needs to be replaced as below, followed by more testing on different variants for HA setup.

systemctl restart zookeeper-server

One more issue I noticed during my testing is oozie.services.ext isn't having all the required configs for HA as per https://oozie.apache.org/docs/5.2.1/AG_Install.html#Pre-requisites. Can we please double check this?

cjac commented 1 year ago

Okay. Thanks for the review. Do you want me to grant you permission to update the PR while I'm afk? I can get online for 20m tomorrow and do that if it's urgent.

On Thu, Jul 13, 2023, 13:40 kuldeepkk-dev @.***> wrote:

Tested this manually on dataproc 2.1-debian11 and zookeeper is failing to restart properly. Due to which init action is failing followed by cluster creation failures.

Line 577 needs to be replaced as below, followed by more testing on different variants for HA setup.

systemctl restart zookeeper-server

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudDataproc/initialization-actions/pull/1068#issuecomment-1634880241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAM6UQ4WBKCPUXBY7AUW4LXQBMKLANCNFSM6AAAAAAZ7KNYGU . You are receiving this because you authored the thread.Message ID: @.*** com>

kuldeepkk-dev commented 1 year ago

Sure CJ. If you can grant me the permissions, I can make the necessary changes and continue testing.

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun

cjac commented 1 year ago

/gcbrun