GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Improve submitting/tracking job and fix job recovery bug #372

Closed elanv closed 3 years ago

elanv commented 3 years ago

This PR improves submitting/tracking job and fixes job recovery bugs. I made this PR for now because there are some changes and contents to discuss. (It still works even though there are some minor works.)

Changes

Discussion

About pyFlink support: REST API does not support python job submission at this time. There is a related PR in upstream, but it seems no further progress (https://github.com/apache/flink/pull/8532#issuecomment-730962739). If we need to support pyFlink now, we need to change the job submitter to use flink CLI, which requires a change in implementation because Flink CLI does not allow specifying the ID of the job to submit. (In this case, we could make the job submitter with two containers like submitter containers and result containers - result container writes the job ID to output log and the operator reads it.)

elanv commented 3 years ago

I thought about it more, but it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later. What do you think? @functicons

functicons commented 3 years ago

Thanks for the PR! I will need a few days before getting back to you.

functicons commented 3 years ago

it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later.

@elanv what's the reason behind it? I feel getting ID from the pod log is unreliable.

For PyFlink, we don't have to support it now, but it would be nice if the upstream Flink CLI could be improved to accept client specified ID. I guess the implementation should be easy. That way Flink CLI could be used for both. Or we could wait for https://github.com/apache/flink/pull/8532#issuecomment-730962739 to be resolved before adding PyFlink support.

elanv commented 3 years ago

It's not now, but I've found out because I need to support pyFlink later and it was mentioned in another issue. (Anyway, I found it's better to use the container termination message recorded in the Pod resource instead of requesting log to running Pod.)

If we can wait for support from upstream, I would like to finish the remaining work of this PR and ask you to review again.

functicons commented 3 years ago

Seems the problem for now is about whether PyFlink support is a priority, I don't have a preference, but if you think it is important, then it is okay to first use CLI and getting the ID via the pod log.

elanv commented 3 years ago

Replaced with #379