Closed ericfranz closed 5 years ago
Though going this route (the non-flat hash instead of a flat array) would require a separate render template when rendering the json, as we would want to omit the native hash and other unused fields from the result before converting to JSON string and sending in the response.
Before doing this, we should do some sort of measurement to determine how much time is spent in get_jobs excluding calling job_adapter.info*
. Some of the instances where we experienced significant slowdown were due to the adapter code itself (or the actual qstat call).
I remember we originally had a hash and I argued for having a flat array of jobs (as well as the separate model object, as that was simpler to reason about in the code and simpler to handle it client side). Also, FWIW this is the code that turns the info objects into a hash with all the params: https://github.com/OSC/ood_core/blob/c9077615951e91bddf126db21247684596b69495/lib/ood_core/job/info.rb#L105-L125. So this is going to be called anyways on each info object if we don't do it ourselves (with our own code, which is what the Jobstatusdata model currently does).
Also, these discussions ignore the fact that we don't have a streaming solution, so there is a limit to the number of jobs that can be handled through this approach. But since we deal with 10s and 100s of users and 100s and 1000s of jobs (not usually 10,000s or 100,000s) we have been able to get away with our current approach.
Though reducing the amount of code that actually is used in get_jobs could make it easier to shift to a streaming solution in the future (less code to modify).
I've run the benchmarks.
Neither of these suggestions will significantly improve the speed of this app, and the JobStatusData object is itself an optimization. Currently the bottlenecks are in getting the info from the adapters, transferring the data to the users, and frontend processing of the data.
Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.
"benchmark":
Time to get adapters: 1073;
Time to convert to objects: 147;
Time to filter user: 156;
Total server time: 1377;
Got data from server in: 2010 ms;
Frontend operations complete in: 4065 ms;
Time to get adapters
is the cumulative time of requesting the info_all
data.Time to convert objects
is the cumulative time of processing all of that data into Jobstatusdata
objects.Time to filter user
is the total time to filter the user up to the top of the arrayTotal server time
is the total time spent in the info_all
methodGot data from server in
is the time between the beginning of the server request and the data arriving completely at the browserFrontend operations complete in
is the amount of time between the beginning of the ajax request on the client side and the end of the datatables rendering process.Right now that breaks down to roughly
Jobstatusdata
objectsServer-side in-memory processing is not the most expensive component of the get_jobs
method and leads to speed improvements in the other bottlenecks. By not pre-processing data on the server-side we'd see increases in both data transfer and frontend processing times.
Additional benchmarks:
For a request for "All jobs" on "Owens"
"benchmark":
Time to get adapters: 500;
Time to get objects: 57;
Time to filter user: 55;
Total server time: 612;
Got data from server in: 991 ms;
Frontend operations complete in: 1830 ms;
When I request "Your Jobs" on "All Clusters" and I have one running job:
"benchmark":
Time to get adapters: 18;
Time to get objects: 0;
Time to filter user: 0;
Total server time: 18;
Got data from server in: 412 ms;
Frontend operations complete in: 441 ms;
Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.
For those metrics, is this based on a recent request i.e. currently Oakley has 1631 jobs, Owens has 1328 jobs, and Ruby has 226 jobs so 3185 total jobs. I wonder what happens this changed to 20000 jobs.
We could definitely improve the performance of the adapter code themselves, though.
Also, how are you producing the metrics?
I've pushed up a PR at https://github.com/OSC/ood-activejobs/pull/143
To get the server-side numbers, check the last key in the json response. To check the client-side numbers, check the browser console.
Using very useful rack-mini-profiler, here are numbers after adding profiling for getting around 1500 Owens jobs.
Action | duration (ms) | from start (ms) | |
---|---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 12.6 | +0.0 | |
Executing action: json | 1.3 | +11.0 | |
owens.job_adapter#info_all | 72.5 | +12.0 | |
Torque::Batch#get_jobs | 506.1 | +12.0 | |
generate Jobstatusdata for 1535 results | 83.1 | +591.0 | |
sorting jobs by username | 137.9 | +674.0 | |
render jobs as json | 261.8 | +812.0 | |
Total | 1075.4 ms |
Action | duration (ms) | from start (ms) |
---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 11.8 | +0.0 |
Executing action: json | 1.0 | +10.0 |
owens.job_adapter#info_all | 70.0 | +11.0 |
Torque::Batch#get_jobs | 494.5 | +11.0 |
generate Jobstatusdata for 1537 results | 72.4 | +575.0 |
sorting jobs by username | 115.3 | +648.0 |
render jobs as json | 208.5 | +763.0 |
Total | 973.4 ms |
The bulk of work is in Torque::Batch#get_jobs
. After adding this monkey patch so with the Torque C library we only get the attributes we will display in the table, instead of all the job attributes, we reduce the ~500ms method call to ~100ms, a huge improvement:
class OodCore::Job::Adapters::Torque::Batch
alias_method :orig_get_jobs, :get_jobs
def get_jobs(id: '', filters: [ :job_state, :queue, :Job_Name, :Account_Name, :job_id, :resources_used ])
orig_get_jobs(id: id, filters: filters)
end
end
Now the bulk of our time is spent after the info_all call: generating Jobstatusdata, sorting jobs by username, and rendering json:
Action | duration (ms) | from start (ms) | |
---|---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 14.7 | +0.0 | |
Executing action: json | 1.0 | +13.0 | |
owens.job_adapter#info_all | 37.8 | +14.0 | |
Torque::Batch#get_jobs | 105.9 | +14.0 | |
generate Jobstatusdata for 1538 results | 47.8 | +158.0 | |
sorting jobs by username | 94.4 | +206.0 | |
render jobs as json | 227.1 | +300.0 | |
Total | 528.7 ms |
Action | duration (ms) | from start (ms) | |
---|---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 14.8 | +0.0 | |
Executing action: json | 1.0 | +13.0 | |
owens.job_adapter#info_all | 35.1 | +14.0 | |
Torque::Batch#get_jobs | 106.5 | +14.0 | |
generate Jobstatusdata for 1538 results | 51.3 | +155.0 | |
sorting jobs by username | 86.1 | +206.0 | |
render jobs as json | 244.9 | +293.0 | |
Total | 539.6 ms |
Bypassing Jobstatusdata, and just doing a map over the info_all call to get the minimal json required is a further improvement:
Action | duration (ms) | from start (ms) | |
---|---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 13.9 | +0.0 | |
Executing action: json | 0.9 | +12.0 | |
owens.job_adapter#info_all | 69.3 | +13.0 | |
Torque::Batch#get_jobs | 94.1 | +13.0 | |
render 1505 jobs as json | 119.6 | +176.0 | |
Total | 297.7 ms |
Action | duration (ms) | from start (ms) | |
---|---|---|---|
GET http://localhost:80/pun/dev/activejobs/js... | 14.0 | +0.0 | |
Executing action: json | 0.9 | +13.0 | |
owens.job_adapter#info_all | 69.6 | +13.0 | |
Torque::Batch#get_jobs | 102.7 | +13.0 | |
render 1502 jobs as json | 136.3 | +186.0 | |
Total | 323.5 ms |
At this point however, the next steps to consider include:
This exploration also shows that:
Two ideas:
stop sorting the user's jobs to the top: https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L128-L131. since there is an explicit user filter in the dropdowns, assume if users want to see their jobs they will filter by user
stop using
Jobstatusdata
forget_jobs
. InJobstatusdata
, most of the attributes we just set directly to attributes on the info object (Jobstatusdata#pbsid
getsinfo.id
,Jobstatusdata#username
getsinfo.job_owner
. These are the attributes we do some sort of transformation, and for each of these the transformation can be done client side in JavaScript at display time, or just ignored completely.self.status = status_label(info.status.state.to_s)
- use JavaScript to render this labelself.cluster = cluster.id.to_s
- see belowself.cluster_title = cluster.metadata.title || cluster.id.to_s.titleize
- see belowself.walltime_used = info.wallclock_time.to_i > 0 ? pretty_time(info.wallclock_time) : ''
- use JavaScript to display walltime in desired formatself.nodes = node_array(info.allocated_nodes)
andself.starttime = info.dispatch_time.to_i
- we don't display nodes in the columns, so this is only needed for when displaying extended data i.e.get_job
and only needed if ganglia graphs are availableself.extended_available = %w(torque slurm lsf pbspro).include?(cluster.job_config[:adapter])
- see belowFor
cluster
,cluster_title
,extended_available
, all we really need to know for a given job record is the cluster id, and a separate javascript object could indicate what the title for the given cluster id is, and whether that cluster has extended_available true. By changing the structure of the object returned "get_jobs" we could remove the necessity of modifying each item we get fromadapter#info_all
. Instead of{ data: [job1, job2, job3], errors: [] }
it could be{data: {ruby: [job1, job2], oakley: [job3, job4], owens: [job5, job6]}
.That would still leave the ruby server side filtering of the info_all https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L103-L108. We might just want to replace that "feature" with something more appropriate, or just eliminate the feature altogether.
The result of these three steps would be that get_jobs would look something like this: