Improve performance `PagesController#get_jobs`

ericfranz commented 7 years ago

Two ideas:

stop sorting the user's jobs to the top: https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L128-L131. since there is an explicit user filter in the dropdowns, assume if users want to see their jobs they will filter by user
stop using Jobstatusdata for get_jobs. In Jobstatusdata, most of the attributes we just set directly to attributes on the info object (Jobstatusdata#pbsid gets info.id, Jobstatusdata#username gets info.job_owner. These are the attributes we do some sort of transformation, and for each of these the transformation can be done client side in JavaScript at display time, or just ignored completely.
1. self.status = status_label(info.status.state.to_s) - use JavaScript to render this label
2. self.cluster = cluster.id.to_s - see below
3. self.cluster_title = cluster.metadata.title || cluster.id.to_s.titleize - see below
4. self.walltime_used = info.wallclock_time.to_i > 0 ? pretty_time(info.wallclock_time) : '' - use JavaScript to display walltime in desired format
5. self.nodes = node_array(info.allocated_nodes) and self.starttime = info.dispatch_time.to_i - we don't display nodes in the columns, so this is only needed for when displaying extended data i.e. get_job and only needed if ganglia graphs are available
6. self.extended_available = %w(torque slurm lsf pbspro).include?(cluster.job_config[:adapter]) - see below

For cluster, cluster_title, extended_available, all we really need to know for a given job record is the cluster id, and a separate javascript object could indicate what the title for the given cluster id is, and whether that cluster has extended_available true. By changing the structure of the object returned "get_jobs" we could remove the necessity of modifying each item we get from adapter#info_all. Instead of { data: [job1, job2, job3], errors: [] } it could be {data: {ruby: [job1, job2], oakley: [job3, job4], owens: [job5, job6]}.

That would still leave the ruby server side filtering of the info_all https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L103-L108. We might just want to replace that "feature" with something more appropriate, or just eliminate the feature altogether.

stop filtering completed jobs out of the list server-side: https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L115-L119

The result of these three steps would be that get_jobs would look something like this:

def get_jobs
  jobs = {}
  errors = []
  selected_filter = get_filter
  selected_cluster = get_cluster

  OODClusters.select { |c| selected_cluster == "all" || selected_cluster == c }.each do |cluster|
    jobs[cluster.id] = []

    begin
      if jobfilter == 'user'
        jobs[cluster.id] = cluster.job_adapter. info_where_owner(OodSupport::User.new.name)
      elsif jobfilter == 'group'
        jobs[cluster.id] = cluster.job_adapter. info_where_group(OodSupport::User.new.group.name)
      else
        jobs[cluster.id] = cluster.job_adapter.info_all
      end
    rescue => e
      # handle error
    end
  end

  { data: jobs, errors: errors }
end

ericfranz commented 7 years ago

Though going this route (the non-flat hash instead of a flat array) would require a separate render template when rendering the json, as we would want to omit the native hash and other unused fields from the result before converting to JSON string and sending in the response.

ericfranz commented 7 years ago

Before doing this, we should do some sort of measurement to determine how much time is spent in get_jobs excluding calling job_adapter.info*. Some of the instances where we experienced significant slowdown were due to the adapter code itself (or the actual qstat call).

I remember we originally had a hash and I argued for having a flat array of jobs (as well as the separate model object, as that was simpler to reason about in the code and simpler to handle it client side). Also, FWIW this is the code that turns the info objects into a hash with all the params: https://github.com/OSC/ood_core/blob/c9077615951e91bddf126db21247684596b69495/lib/ood_core/job/info.rb#L105-L125. So this is going to be called anyways on each info object if we don't do it ourselves (with our own code, which is what the Jobstatusdata model currently does).

ericfranz commented 7 years ago

Also, these discussions ignore the fact that we don't have a streaming solution, so there is a limit to the number of jobs that can be handled through this approach. But since we deal with 10s and 100s of users and 100s and 1000s of jobs (not usually 10,000s or 100,000s) we have been able to get away with our current approach.

Though reducing the amount of code that actually is used in get_jobs could make it easier to shift to a streaming solution in the future (less code to modify).

brianmcmichael commented 7 years ago

I've run the benchmarks.

Neither of these suggestions will significantly improve the speed of this app, and the JobStatusData object is itself an optimization. Currently the bottlenecks are in getting the info from the adapters, transferring the data to the users, and frontend processing of the data.

Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.

"benchmark":
  Time to get adapters: 1073; 
  Time to convert to objects: 147; 
  Time to filter user: 156; 
  Total server time: 1377;
  Got data from server in: 2010 ms;
  Frontend operations complete in: 4065 ms;

Time to get adapters is the cumulative time of requesting the info_all data.
Time to convert objects is the cumulative time of processing all of that data into Jobstatusdata objects.
Time to filter user is the total time to filter the user up to the top of the array
Total server time is the total time spent in the info_all method
Got data from server in is the time between the beginning of the server request and the data arriving completely at the browser
Frontend operations complete in is the amount of time between the beginning of the ajax request on the client side and the end of the datatables rendering process.

Right now that breaks down to roughly

25% of our time getting info objects
4% of our time processing data into Jobstatusdata objects
4% of our time filtering up the user
16% of our time in data transfer
50% of our time processing data on the frontend

Server-side in-memory processing is not the most expensive component of the get_jobs method and leads to speed improvements in the other bottlenecks. By not pre-processing data on the server-side we'd see increases in both data transfer and frontend processing times.

brianmcmichael commented 7 years ago

Additional benchmarks:

For a request for "All jobs" on "Owens"

"benchmark":
    Time to get adapters: 500; 
    Time to get objects: 57; 
    Time to filter user: 55; 
    Total server time: 612;
    Got data from server in: 991 ms;
    Frontend operations complete in: 1830 ms;

When I request "Your Jobs" on "All Clusters" and I have one running job:

"benchmark":
    Time to get adapters: 18; 
    Time to get objects: 0; 
    Time to filter user: 0; 
    Total server time: 18;
    Got data from server in: 412 ms;
    Frontend operations complete in: 441 ms;

ericfranz commented 7 years ago

Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.

For those metrics, is this based on a recent request i.e. currently Oakley has 1631 jobs, Owens has 1328 jobs, and Ruby has 226 jobs so 3185 total jobs. I wonder what happens this changed to 20000 jobs.

We could definitely improve the performance of the adapter code themselves, though.

Also, how are you producing the metrics?

brianmcmichael commented 7 years ago

I've pushed up a PR at https://github.com/OSC/ood-activejobs/pull/143

To get the server-side numbers, check the last key in the json response. To check the client-side numbers, check the browser console.

ericfranz commented 5 years ago

Using very useful rack-mini-profiler, here are numbers after adding profiling for getting around 1500 Owens jobs.

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	12.6	+0.0
Executing action: json	1.3	+11.0
owens.job_adapter#info_all	72.5	+12.0
Torque::Batch#get_jobs	506.1	+12.0
generate Jobstatusdata for 1535 results	83.1	+591.0
sorting jobs by username	137.9	+674.0
render jobs as json	261.8	+812.0
Total	1075.4 ms

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	11.8	+0.0
Executing action: json	1.0	+10.0
owens.job_adapter#info_all	70.0	+11.0
Torque::Batch#get_jobs	494.5	+11.0
generate Jobstatusdata for 1537 results	72.4	+575.0
sorting jobs by username	115.3	+648.0
render jobs as json	208.5	+763.0
Total	973.4 ms

The bulk of work is in Torque::Batch#get_jobs. After adding this monkey patch so with the Torque C library we only get the attributes we will display in the table, instead of all the job attributes, we reduce the ~500ms method call to ~100ms, a huge improvement:

class OodCore::Job::Adapters::Torque::Batch
  alias_method :orig_get_jobs, :get_jobs
  def get_jobs(id: '', filters: [ :job_state, :queue, :Job_Name, :Account_Name, :job_id, :resources_used ])
    orig_get_jobs(id: id, filters: filters)
  end
end

Now the bulk of our time is spent after the info_all call: generating Jobstatusdata, sorting jobs by username, and rendering json:

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	14.7	+0.0
Executing action: json	1.0	+13.0
owens.job_adapter#info_all	37.8	+14.0
Torque::Batch#get_jobs	105.9	+14.0
generate Jobstatusdata for 1538 results	47.8	+158.0
sorting jobs by username	94.4	+206.0
render jobs as json	227.1	+300.0
Total	528.7 ms

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	14.8	+0.0
Executing action: json	1.0	+13.0
owens.job_adapter#info_all	35.1	+14.0
Torque::Batch#get_jobs	106.5	+14.0
generate Jobstatusdata for 1538 results	51.3	+155.0
sorting jobs by username	86.1	+206.0
render jobs as json	244.9	+293.0
Total	539.6 ms

Bypassing Jobstatusdata, and just doing a map over the info_all call to get the minimal json required is a further improvement:

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	13.9	+0.0
Executing action: json	0.9	+12.0
owens.job_adapter#info_all	69.3	+13.0
Torque::Batch#get_jobs	94.1	+13.0
render 1505 jobs as json	119.6	+176.0
Total	297.7 ms

Action	duration (ms)	from start (ms)
GET http://localhost:80/pun/dev/activejobs/js...	14.0	+0.0
Executing action: json	0.9	+13.0
owens.job_adapter#info_all	69.6	+13.0
Torque::Batch#get_jobs	102.7	+13.0
render 1502 jobs as json	136.3	+186.0
Total	323.5 ms

At this point however, the next steps to consider include:

Follow suggestions at top, shifting responsibility of some of this code to the client
Shift more of the responsibility to the ood_core (including the "native attrs" and "human readable strings" etc. for job details)
Try a streaming approach as opposed to a read 100%. Does this have a negative impact?
Need to repeat this work with other adapters. Using a Vagrant image with those adapters, systems with those adapters, or mock data can be helpful.
Consider https://github.com/ohler55/oj after all. Though this was originally dropped from the Dashboard app due to concerns about adding extra dependencies.

This exploration also shows that:

Optimizing this app's code will benefit all adapters
It is worth spending some time with ood_core itself, and possibly writing an automated performance test with mock data.
Even with this work, at some point it isn't worth optimizing too much further when we could explore a caching solution. That will be easier to implement in the coming year though than it is right now.

OSC / ood-activejobs

Improve performance `PagesController#get_jobs` #142