cylondata / twister2

A composable framework for fast and scalable data analytics
https://twister2.org
Apache License 2.0
57 stars 32 forks source link

restoring k8s jobs from checkpoints #910

Closed ahmet-uyar closed 4 years ago

ahmet-uyar commented 4 years ago

I implemented job restarts from checkpoints. Failed or killed checkpointed jobs can be restarted from persistent storage. Users shall provide the jobID of the previous job. They shall specify jobID as the value of the following parameter: twister2.job.id They shall also set the following config parameter to restore the job: twister2.checkpointing.restore.job

I have also thrown an exception when there is send/receive exception at TCPChannel. So that workers can fail and restart properly.

supunkamburugamuve commented 4 years ago

@ahmet-uyar does this supported via command line? There is a restart command available.

ahmet-uyar commented 4 years ago

@ahmet-uyar does this supported via command line? There is a restart command available.

@supunkamburugamuve Users submit the job as usual with the submit command. However, two configuration parameters need to be set up. Job id of the job to be restored (twister2.job.id) and the restore flag (twister2.checkpointing.restore.job).

supunkamburugamuve commented 4 years ago

@chathurawidanage we can use the restart command to this purpose? I think specifying these are configuration parameters can be hard for the user?

chathurawidanage commented 4 years ago

Yes. Restart command should work here.

./bin/twister2 restart

ahmet-uyar commented 4 years ago

@chathurawidanage we can use the restart command to this purpose? I think specifying these are configuration parameters can be hard for the user?

@supunkamburugamuve let me look into that.

ahmet-uyar commented 4 years ago

@supunkamburugamuve and @chathurawidanage

I implemented restart command as: $ twister2 restart $cluster-type jobID

I also implemented clear and clearall: $ twister2 clear $cluster-type jobID $ twister2 clearall $cluster-type

I think we can merge it now.

chathurawidanage commented 4 years ago

Ahemt could you please merge the latest changes from master to your branch? Seems merging is blocked due to conflicts.

ahmet-uyar commented 4 years ago

@chathurawidanage and @supunkamburugamuve i merged msater branch and also did a few improvements. now it seems ok.