Closed elanv closed 3 years ago
I thought about it more, but it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later. What do you think? @functicons
Thanks for the PR! I will need a few days before getting back to you.
it would be nice to implement it by using CLI and getting the ID via the pod log. It is better to apply the REST API method later.
@elanv what's the reason behind it? I feel getting ID from the pod log is unreliable.
For PyFlink, we don't have to support it now, but it would be nice if the upstream Flink CLI could be improved to accept client specified ID. I guess the implementation should be easy. That way Flink CLI could be used for both. Or we could wait for https://github.com/apache/flink/pull/8532#issuecomment-730962739 to be resolved before adding PyFlink support.
It's not now, but I've found out because I need to support pyFlink later and it was mentioned in another issue. (Anyway, I found it's better to use the container termination message recorded in the Pod resource instead of requesting log to running Pod.)
If we can wait for support from upstream, I would like to finish the remaining work of this PR and ask you to review again.
Seems the problem for now is about whether PyFlink support is a priority, I don't have a preference, but if you think it is important, then it is okay to first use CLI and getting the ID via the pod log.
Replaced with #379
This PR improves submitting/tracking job and fixes job recovery bugs. I made this PR for now because there are some changes and contents to discuss. (It still works even though there are some minor works.)
Changes
Submit a job Submit via REST API instead of flink CLI. Submit a job with the job ID generated by the operator, and the operator tracks the job with that ID.
Pros: When using the CLI, the client side generates and sends a job graph to the jobmanager, but when using the REST API, the jobmanager side generates a job graph, reducing the resources used by the job submitter. Because curl is used, the job submitter image can also be lightened.
Cons: The REST API does not yet support pyFlink.
Job tracking Previously, the job submitter was responsible for monitoring the job, therefore it was not terminated after submitting the job. But this PR changed the operator to track it instead of the job submitter. The advantage is that it saves the resources allocated to it by terminating the job submitter.
Fix job recovery bug Fix bug that job recovery does not work if more than one job history remains in the job manager
Discussion
About pyFlink support: REST API does not support python job submission at this time. There is a related PR in upstream, but it seems no further progress (https://github.com/apache/flink/pull/8532#issuecomment-730962739). If we need to support pyFlink now, we need to change the job submitter to use flink CLI, which requires a change in implementation because Flink CLI does not allow specifying the ID of the job to submit. (In this case, we could make the job submitter with two containers like submitter containers and result containers - result container writes the job ID to output log and the operator reads it.)