datastax / cstar_perf

Apache Cassandra performance testing platform
Apache License 2.0
72 stars 34 forks source link

Don't mark jobs as failed when the client is asking for a new job #213

Closed nastra closed 8 years ago

nastra commented 8 years ago

I think this is a potential new race condition, which might be introduced with #204. The new behavior that came with #204 is that terminate() is used in order to perform a graceful shutdown of cstar_perf_stress (which calls stress_compare). This graceful shutdown is needed to grab the logs in case the job failed / got cancelled. The side effect of this is that it takes a bit longer to write results to the DB and eventually mark a job as completed.

If we look closely at the logs, Line 4 clearly tells that the job completed. In Line 5 the client asks for another job by executing get_work() from cluster_api.py. However, if the client detects that there are still jobs in_progress, it will mark them as failed, which isn't correct IMO. Line 9 shows that the new state of the job indeed changed to failed.

If the completed state wasn't written fast enough to the DB, a job can be marked as failed by the client, even though there was no failure (this is what was happening in my case)

1. DEBUG:cstar_perf.client:Sending job completion message for b674c67c-1da5-11e6-976c-0cc47a4e5c68 ...
2. DEBUG:cstar_perf.socket_comms:Sending command : {"action": "test_done", "status": "completed", "test_id": "b674c67c-1da5-11e6-976c-0cc47a4e5c68", "type": "command", "command_id": "44db6632-1da6-11e6-bf82-0cc47a4e5c68"}
3. DEBUG:cstar_perf.socket_comms:Received response : {u'type': u'response', u'message': u'test_update', u'test_id': u'b674c67c-1da5-11e6-976c-0cc47a4e5c68', u'done': True, u'command_id': u'44db6632-1da6-11e6-bf82-0cc47a4e5c68'}
4. DEBUG:cstar_perf.client:Server confirms job b674c67c-1da5-11e6-976c-0cc47a4e5c68 is complete.
5. DEBUG:cstar_perf.client:Asking for a new job...
6. DEBUG:cstar_perf.socket_comms:Sending command : {"action": "get_work", "type": "command", "command_id": "44e217e8-1da6-11e6-bf82-0cc47a4e5c68"}
7. DEBUG:urllib3.connectionpool:Setting read timeout to None
8. DEBUG:urllib3.connectionpool:"GET /api/tests/status/id/b674c67c-1da5-11e6-976c-0cc47a4e5c68 HTTP/1.1" 200 24
9. DEBUG:cstar_perf.client:JobCancellationTracker -- status of test_id b674c67c-1da5-11e6-976c-0cc47a4e5c68 is: {u'status': u'failed'}

The changes in model.py are only to improve debugging and write more meaningful information to the logs.

nastra commented 8 years ago

In fact I think the problem's origin is something different. I need to do more digging