Timeout for work machines and server

DistributedTaskScheduling / JobAdder

Source code of the JobAdder project

GNU General Public License v3.0

2 stars 1 forks source link

Timeout for work machines and server #120

Closed FellowPlanter closed 4 years ago

FellowPlanter commented 4 years ago

Marks work machines as offline whenever the Dispatcher could not send it's commands in due time (120 secs) and informs the user if a connection to the server couldn't be established.

fixes #106

FellowPlanter commented 4 years ago

@ammen99 It does now, but the the return type of set_distribution had to be changed. The Dispatcher sends also a dummy command to all work machines in order to detect offline machines.

FellowPlanter commented 4 years ago

Second commit fixes #174 by skipping all jobs that are set to be run in the initial iteration through the new job distribution and dispatching/resuming them in the second iteration. That should be faster than sorting the distribution.

ammen99 commented 4 years ago

By the way, sorting won't be a problem since I already sort the distribution once in the scheduler, which means that complexity will remain the same. However, I think sorting makes the logic simpler.

ammen99 commented 4 years ago

I think we should separate the changes in this PR in two parts, and prioritize the one with sorting job entries before dispatching (Otherwise, we may get thrashing on one worker if one of the commands is CANCEL, or maybe the command will fail because it tries to acquire a license which isn't freed yet)

ammen99 commented 4 years ago

@M1keReck I have included a few fixes in the timeout-2 branch, I think you should include them. For example:

Changed it so that you send the dummy command only to jobs which you don't check otherwise.
I made it so that the dispatcher doesn't return multiple machines
I made it so that you only set the running jobs to crashed, not old jobs.
I fixed the offline scheduler test, it was wrong.

ammen99 commented 4 years ago

I think you need to rebase.

FellowPlanter commented 4 years ago

Build