datastax / cstar_perf

Apache Cassandra performance testing platform
Apache License 2.0
72 stars 34 forks source link

CSTAR-85: Improve JobCancellationTracker / Gracefully handle termination in stress_compare #204

Closed nastra closed 8 years ago

nastra commented 8 years ago

Commit 1 - making Cancel actually work

Previously, API calls to /api/tests/status/id/ or /api/tests/progress/id would be failing with a 401 - Unauthorized, because the API user test-cluster wouldn't be in the users table and also wouldn't have a role assigned.

This shows that the API user test-cluster by default is not in the users table. Also I could not find any documentation for that requirement.

select * from cstar_perf.users;

 user_id           | full_name       | roles
-------------------+-----------------+-------------------
  user@example.com |  User Full Name |          {'user'}
 admin@example.com | Admin Full Name | {'admin', 'user'}

(2 rows)
cqlsh> select * from cstar_perf.api_pubkeys ;

 name         | pubkey                                                           | user_type
--------------+------------------------------------------------------------------+-----------
 test-cluster | LrDzaqtOF5t8RDFkqZE3N57biLAaTu+aqXVKK2iuNBCecmnUmHx4tuN89/p56yzB |   cluster

After talking to @aboudreault, we kind of came to the conclusion, that we really shouldn't do the role check for the API endpoints that are mentioned here. A simple authentication check should be enough at this point.

Below is an excerpt, showing that the API call to /api/tests/status/id/ indeed was using the API user test-cluster and it would fail with a 401.

INFO:geventwebsocket.handler:172.17.0.2 - - [2016-05-10 17:26:58] "GET /api/tests/status/id/422b7d06-16d4-11e6-8007-0242ac110004 HTTP/1.1" 401 6843 0.051804
DEBUG:geventwebsocket.handler:Initializing WebSocket
DEBUG:geventwebsocket.handler:Validating WebSocket request
INFO:geventwebsocket.handler:172.17.0.2 - - [2016-05-10 17:26:58] "GET /api/login HTTP/1.1" 200 560 0.026101
DEBUG:geventwebsocket.handler:Initializing WebSocket
DEBUG:geventwebsocket.handler:Validating WebSocket request
DEBUG:geventwebsocket.handler:Can only upgrade connection if using GET method.
INFO:geventwebsocket.handler:172.17.0.2 - - [2016-05-10 17:26:58] "POST /api/login HTTP/1.1" 200 451 0.070820
DEBUG:geventwebsocket.handler:Initializing WebSocket
DEBUG:geventwebsocket.handler:Validating WebSocket request
ERROR:cstar_perf.controllers:REQUIRES AUTH
INFO:cstar_perf.controllers:test-cluster
INFO:cstar_perf.controllers:True
INFO:cstar_perf.controllers:<SecureCookieSession {u'logged_in': True, u'_csrf_token': 'T4X5SS94K56O9NV6AJDYCGR85PYIO7LQ', u'user_id': u'test-cluster', u'bypass_csrf': True, u'unsigned_access_token': 'PNT834USH58HNDM0P9B4A0EBN9U6YJLR'}>
INFO:cstar_perf.controllers:<cstar_perf.frontend.lib.crypto.APIKey object at 0x7f61776edc50>
INFO:geventwebsocket.handler:172.17.0.2 - - [2016-05-10 17:26:58] "GET /api/tests/status/id/422b7d06-16d4-11e6-8007-0242ac110004 HTTP/1.1" 401 6843 0.024216

PS: This issue happens on a dockerized cstar_perf deployment as well as on the DSE cstar_perf deployment that we have. I think it doesn't happen on cstar.datastax.com because the API user was added to the users table.

Commit 2 - Handling termination from Cancel in a graceful manner

2) After solving the Cancel problem in the first commit, I finally know now why it actually never attempts to copy back the log files. The problem is that the JobCancellationTracker is killing cassandra-stress and stress_compare (which handles creation of the stats files / copies back the log files / copies the flamegraphs / ...). Not killing the stress_compare process would actually mean that once you hit the Cancel button:

The second commit now handles the termination of stress_compare in a graceful manner and tries to finish processing the files of the current running revision.

Commit 3 - Correctly set the job status to cancelled if it was cancelled by the user

The third commit now correctly sets the job status to cancelled in the DB if the job was cancelled by the user.

aboudreault commented 8 years ago

LGTM Eduard!

nastra commented 8 years ago

@mshuler one thing I forgot to mention is that a cancelled test now shows up as completed (previously it would just have failed because of missing stats files). However, it will only contain partial data from graphs / logs, so people might get confused. I think we should consider adding a new status Cancelled or something. wdyt?

mshuler commented 8 years ago

I would be supportive of displaying a Cancelled status - in fact, I would be supportive of displaying status Failed tests, too, since they are nearly impossible to find other user's failed tests.

nastra commented 8 years ago

@mshuler I totally agree that we should have an overview of Failed and Cancelled tests too (which should be done in a separate PR).

Just to make things clear: if a user now hits the Cancel button, the job will eventually have the Completed status in the UI, but only have partial graphs/logs and such

completed_status

This might lead to some confusion.

Therefore I'm suggesting that jobs, which got cancelled actually get the cancelled status in the UI. The question is if we want to do this cancelled status as part of CSTAR-85 or a separate ticket & PR?

nastra commented 8 years ago

@mshuler if you're done reviewing, please don't merge yet.

mshuler commented 8 years ago

At this time, I still don't have a dev env to test on without setting up on real server(s). https://github.com/datastax/cstar_perf/issues/209

I trust it'll be fine, and we'll fix it if not :)