I think this is a potential new race condition, which might be introduced with #204. The new behavior that came with #204 is that terminate() is used in order to perform a graceful shutdown of cstar_perf_stress (which calls stress_compare). This graceful shutdown is needed to grab the logs in case the job failed / got cancelled. The side effect of this is that it takes a bit longer to write results to the DB and eventually mark a job as completed.
If we look closely at the logs, Line 4 clearly tells that the job completed. In Line 5 the client asks for another job by executing get_work() from cluster_api.py. However, if the client detects that there are still jobs in_progress, it will mark them as failed, which isn't correct IMO. Line 9 shows that the new state of the job indeed changed to failed.
If the completed state wasn't written fast enough to the DB, a job can be marked as failed by the client, even though there was no failure (this is what was happening in my case)
1. DEBUG:cstar_perf.client:Sending job completion message for b674c67c-1da5-11e6-976c-0cc47a4e5c68 ...
2. DEBUG:cstar_perf.socket_comms:Sending command : {"action": "test_done", "status": "completed", "test_id": "b674c67c-1da5-11e6-976c-0cc47a4e5c68", "type": "command", "command_id": "44db6632-1da6-11e6-bf82-0cc47a4e5c68"}
3. DEBUG:cstar_perf.socket_comms:Received response : {u'type': u'response', u'message': u'test_update', u'test_id': u'b674c67c-1da5-11e6-976c-0cc47a4e5c68', u'done': True, u'command_id': u'44db6632-1da6-11e6-bf82-0cc47a4e5c68'}
4. DEBUG:cstar_perf.client:Server confirms job b674c67c-1da5-11e6-976c-0cc47a4e5c68 is complete.
5. DEBUG:cstar_perf.client:Asking for a new job...
6. DEBUG:cstar_perf.socket_comms:Sending command : {"action": "get_work", "type": "command", "command_id": "44e217e8-1da6-11e6-bf82-0cc47a4e5c68"}
7. DEBUG:urllib3.connectionpool:Setting read timeout to None
8. DEBUG:urllib3.connectionpool:"GET /api/tests/status/id/b674c67c-1da5-11e6-976c-0cc47a4e5c68 HTTP/1.1" 200 24
9. DEBUG:cstar_perf.client:JobCancellationTracker -- status of test_id b674c67c-1da5-11e6-976c-0cc47a4e5c68 is: {u'status': u'failed'}
The changes in model.py are only to improve debugging and write more meaningful information to the logs.
I think this is a potential new race condition, which might be introduced with #204. The new behavior that came with #204 is that terminate() is used in order to perform a graceful shutdown of cstar_perf_stress (which calls stress_compare). This graceful shutdown is needed to grab the logs in case the job failed / got cancelled. The side effect of this is that it takes a bit longer to write results to the DB and eventually mark a job as completed.
If we look closely at the logs, Line 4 clearly tells that the job completed. In Line 5 the client asks for another job by executing get_work() from cluster_api.py. However, if the client detects that there are still jobs in_progress, it will mark them as failed, which isn't correct IMO. Line 9 shows that the new state of the job indeed changed to failed.
If the completed state wasn't written fast enough to the DB, a job can be marked as failed by the client, even though there was no failure (this is what was happening in my case)
The changes in model.py are only to improve debugging and write more meaningful information to the logs.