apache / incubator-streampark

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.
https://streampark.apache.org/
Apache License 2.0
3.91k stars 1.01k forks source link

[Bug] Bug titleThe running status of the flink job is stuck in STARTING, but in fact the k8s job is running normally and k8s-ingress has been generated. #2404

Open sq8852161 opened 1 year ago

sq8852161 commented 1 year ago

Search before asking

Java Version

1.8

Scala Version

2.12.x

StreamPark Version

v2.0.0

Flink Version

1.15.1

deploy mode

kubernetes-application

What happened

The running status of the flink job is stuck in STARTING, but in fact the k8s job is running normally and k8s-ingress has been generated.

Background report warn

ForkJoinPool-1-worker-1 | org.apache.streampark.flink.kubernetes.watcher.FlinkJobStatusWatcher:228] Failed to visit remote flink jobs on kubernetes-native-mode cluster, and the retry access logic is performed.

My action:

  1. Configure the ingress domain name
  2. A flink job is created, select ClusterIP and run.
  3. The running status of the flink job is stuck in STARTING, but in fact the k8s job is running normally and k8s-ingress has been generated. 4.Background report warn

ForkJoinPool-1-worker-1 | org.apache.streampark.flink.kubernetes.watcher.FlinkJobStatusWatcher:228] Failed to visit remote flink jobs on kubernetes-native-mode cluster, and the retry access logic is performed.

5.But after removing the ingress domain configuration. Re-establish the job and use NodePort and the task will run normally The background will also report warn: The retry fetch failed, final status failed, errorStack=Connect to http://xx.xx.xx.xxx:32185 [/xx.xx.xx.xxx] failed: Connection refused (Connection refused). But after three reports, it will work normally.

Error Exception

No response

Screenshots

No response

Are you willing to submit PR?

Code of Conduct

phoeph commented 1 year ago

streamPark:2.0.0 flink:1.16.1 on k8s

+1

stuck in INITIALIZING and STARTING.

sq8852161 commented 1 year ago

这是来自QQ邮箱的自动回复邮件。已经收到您的邮件。 

moranrr commented 1 year ago

streamPark:2.0.0 flink:1.15.2 on k8s +1 Sometimes the Running status remains “initializing” even though the job is running normally. When the job finishes or is stopped, the status does not show “FINISHED” or “CANCELED”, but instead shows “FAILED”.

sq8852161 commented 1 year ago

这是来自QQ邮箱的自动回复邮件。已经收到您的邮件。 

wolfboys commented 1 year ago

hi: The reason has already been given in the log. Please check if the network connection is ok.

The retry fetch failed, final status failed, errorStack=Connect to http://xx.xx.xx.xxx:32185/ [/xx.xx.xx.xxx] failed: Connection refused (Connection refused).

Shmilyqjj commented 1 year ago

This error can have many different reasons that I encountered and resolved. It's possible that your situation is different from mine, but I still hope to help you.

My Env: (streampark2.1.0 + flink1.14.5 + k8s Major:"1", Minor:"22+")

1.Check the version of K8s to ensure it is 1.19 or above.

2.Verify if .kube/config has the appropriate permissions (I encountered this issue due to insufficient permissions for the service account).

3.Flink REST endpoint cannot be accessed or requested .I found that I could make a successful REST request using curl, but streampark couldn't retrieve the REST address. After investigation, I discovered an error in the method org.apache.streampark.flink.kubernetes.ingress.IngressController#ingressUrlAddress, where the parsing of the K8s version was incorrect. The code in org.apache.streampark.flink.kubernetes.ingress.IngressStrategy was modified as follows: val version = s"${versionInfo.getMajor}.${versionInfo.getMinor}".toDouble Changed to: val version = s"${versionInfo.getMajor}.${versionInfo.getMinor}".replace("+", "").toDouble This resolved the issue.

  1. When submitting the Flink on K8s job, it is important to ensure that the streampark.workspace.remote parameter is configured correctly. You can specify an HDFS (or file://) address, and the Flink parameter jobmanager.archive.fs.dir should also point to this address (streampark automatically appends this parameter). (In my case, I mounted a shared directory PVC that was accessible in both the Flink pod and streampark pod,then i solved the issue.)

希望对你有帮助.

wolfboys commented 1 year ago

This error can have many different reasons that I encountered and resolved. It's possible that your situation is different from mine, but I still hope to help you.

My Env: (streampark2.1.0 + flink1.14.5 + k8s Major:"1", Minor:"22+")

1.Check the version of K8s to ensure it is 1.19 or above.

2.Verify if .kube/config has the appropriate permissions (I encountered this issue due to insufficient permissions for the service account).

3.Flink REST endpoint cannot be accessed or requested .I found that I could make a successful REST request using curl, but streampark couldn't retrieve the REST address. After investigation, I discovered an error in the method org.apache.streampark.flink.kubernetes.ingress.IngressController#ingressUrlAddress, where the parsing of the K8s version was incorrect. The code in org.apache.streampark.flink.kubernetes.ingress.IngressStrategy was modified as follows: val version = s"${versionInfo.getMajor}.${versionInfo.getMinor}".toDouble Changed to: val version = s"${versionInfo.getMajor}.${versionInfo.getMinor}".replace("+", "").toDouble This resolved the issue.

  1. When submitting the Flink on K8s job, it is important to ensure that the streampark.workspace.remote parameter is configured correctly. You can specify an HDFS (or file://) address, and the Flink parameter jobmanager.archive.fs.dir should also point to this address (streampark automatically appends this parameter). (In my case, I mounted a shared directory PVC that was accessible in both the Flink pod and streampark pod,then i solved the issue.)

希望对你有帮助.

Thanks for your feedback. There was indeed a bug in the version parse, We have already fixed it. please see here