eosnetworkfoundation / chicken-dance

Chicken Dance distributed replay of transactions
MIT License
0 stars 0 forks source link

Add cron job to do HTTP PUT `/jobtimeoutcheck` that checks for timeout conditions and performs corrective action #80

Open ericpassmore opened 9 months ago

ericpassmore commented 9 months ago

Timeout check needs to

### Tasks
- [ ] set status to `TIMEOUT` when threshold exceeded on last_updated or start_time when last_update is not set
- [ ] call `/hosts?reap=all` to kill off hosts associated with the timeout job 
- [ ] re-inits job back to status `WAITING_4_WORKER`
- [ ] post `/hosts?count=` to start new host to take the place of hosts terminated above
ericpassmore commented 6 months ago

To support AWS spot instances we need to recover jobs that are orphaned. This is accomplished via a timeout check. Jobs status and times are updated every few minutes with most recent block. So every WORKING job should have a recent update time. Add cron job to do HTTP PUT /jobtimeoutcheck that checks for timeout conditions and performs corrective action

ericpassmore commented 6 months ago

Tagged with faster-replay because spot instances should allow 2x more nodes.