OSC / ood-activejobs

[MOVED] Active Jobs provides details of scheduled jobs on an HPC cluster.
https://osc.github.io/Open-OnDemand/
MIT License
0 stars 1 forks source link

Improve performance `PagesController#get_jobs` #142

Closed ericfranz closed 5 years ago

ericfranz commented 7 years ago

Two ideas:

  1. stop sorting the user's jobs to the top: https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L128-L131. since there is an explicit user filter in the dropdowns, assume if users want to see their jobs they will filter by user

  2. stop using Jobstatusdata for get_jobs. In Jobstatusdata, most of the attributes we just set directly to attributes on the info object (Jobstatusdata#pbsid gets info.id, Jobstatusdata#username gets info.job_owner. These are the attributes we do some sort of transformation, and for each of these the transformation can be done client side in JavaScript at display time, or just ignored completely.

    1. self.status = status_label(info.status.state.to_s) - use JavaScript to render this label
    2. self.cluster = cluster.id.to_s - see below
    3. self.cluster_title = cluster.metadata.title || cluster.id.to_s.titleize - see below
    4. self.walltime_used = info.wallclock_time.to_i > 0 ? pretty_time(info.wallclock_time) : '' - use JavaScript to display walltime in desired format
    5. self.nodes = node_array(info.allocated_nodes) and self.starttime = info.dispatch_time.to_i - we don't display nodes in the columns, so this is only needed for when displaying extended data i.e. get_job and only needed if ganglia graphs are available
    6. self.extended_available = %w(torque slurm lsf pbspro).include?(cluster.job_config[:adapter]) - see below

For cluster, cluster_title, extended_available, all we really need to know for a given job record is the cluster id, and a separate javascript object could indicate what the title for the given cluster id is, and whether that cluster has extended_available true. By changing the structure of the object returned "get_jobs" we could remove the necessity of modifying each item we get from adapter#info_all. Instead of { data: [job1, job2, job3], errors: [] } it could be {data: {ruby: [job1, job2], oakley: [job3, job4], owens: [job5, job6]}.

That would still leave the ruby server side filtering of the info_all https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L103-L108. We might just want to replace that "feature" with something more appropriate, or just eliminate the feature altogether.

  1. stop filtering completed jobs out of the list server-side: https://github.com/OSC/ood-activejobs/blob/e7a0c471991d9af492d9fec72531a68fd5e1311a/app/controllers/pages_controller.rb#L115-L119

The result of these three steps would be that get_jobs would look something like this:

def get_jobs
  jobs = {}
  errors = []
  selected_filter = get_filter
  selected_cluster = get_cluster

  OODClusters.select { |c| selected_cluster == "all" || selected_cluster == c }.each do |cluster|
    jobs[cluster.id] = []

    begin
      if jobfilter == 'user'
        jobs[cluster.id] = cluster.job_adapter. info_where_owner(OodSupport::User.new.name)
      elsif jobfilter == 'group'
        jobs[cluster.id] = cluster.job_adapter. info_where_group(OodSupport::User.new.group.name)
      else
        jobs[cluster.id] = cluster.job_adapter.info_all
      end
    rescue => e
      # handle error
    end
  end

  { data: jobs, errors: errors }
end
ericfranz commented 7 years ago

Though going this route (the non-flat hash instead of a flat array) would require a separate render template when rendering the json, as we would want to omit the native hash and other unused fields from the result before converting to JSON string and sending in the response.

ericfranz commented 7 years ago

Before doing this, we should do some sort of measurement to determine how much time is spent in get_jobs excluding calling job_adapter.info*. Some of the instances where we experienced significant slowdown were due to the adapter code itself (or the actual qstat call).

I remember we originally had a hash and I argued for having a flat array of jobs (as well as the separate model object, as that was simpler to reason about in the code and simpler to handle it client side). Also, FWIW this is the code that turns the info objects into a hash with all the params: https://github.com/OSC/ood_core/blob/c9077615951e91bddf126db21247684596b69495/lib/ood_core/job/info.rb#L105-L125. So this is going to be called anyways on each info object if we don't do it ourselves (with our own code, which is what the Jobstatusdata model currently does).

ericfranz commented 7 years ago

Also, these discussions ignore the fact that we don't have a streaming solution, so there is a limit to the number of jobs that can be handled through this approach. But since we deal with 10s and 100s of users and 100s and 1000s of jobs (not usually 10,000s or 100,000s) we have been able to get away with our current approach.

Though reducing the amount of code that actually is used in get_jobs could make it easier to shift to a streaming solution in the future (less code to modify).

brianmcmichael commented 7 years ago

I've run the benchmarks.

Neither of these suggestions will significantly improve the speed of this app, and the JobStatusData object is itself an optimization. Currently the bottlenecks are in getting the info from the adapters, transferring the data to the users, and frontend processing of the data.

Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.

"benchmark":
  Time to get adapters: 1073; 
  Time to convert to objects: 147; 
  Time to filter user: 156; 
  Total server time: 1377;
  Got data from server in: 2010 ms;
  Frontend operations complete in: 4065 ms;

Right now that breaks down to roughly

Server-side in-memory processing is not the most expensive component of the get_jobs method and leads to speed improvements in the other bottlenecks. By not pre-processing data on the server-side we'd see increases in both data transfer and frontend processing times.

brianmcmichael commented 7 years ago

Additional benchmarks:

For a request for "All jobs" on "Owens"

"benchmark":
    Time to get adapters: 500; 
    Time to get objects: 57; 
    Time to filter user: 55; 
    Total server time: 612;
    Got data from server in: 991 ms;
    Frontend operations complete in: 1830 ms;

When I request "Your Jobs" on "All Clusters" and I have one running job:

"benchmark":
    Time to get adapters: 18; 
    Time to get objects: 0; 
    Time to filter user: 0; 
    Total server time: 18;
    Got data from server in: 412 ms;
    Frontend operations complete in: 441 ms;
ericfranz commented 7 years ago

Here are my results of requesting all jobs from all servers at OSC. All times are in milliseconds.

For those metrics, is this based on a recent request i.e. currently Oakley has 1631 jobs, Owens has 1328 jobs, and Ruby has 226 jobs so 3185 total jobs. I wonder what happens this changed to 20000 jobs.

We could definitely improve the performance of the adapter code themselves, though.

Also, how are you producing the metrics?

brianmcmichael commented 7 years ago

I've pushed up a PR at https://github.com/OSC/ood-activejobs/pull/143

To get the server-side numbers, check the last key in the json response. To check the client-side numbers, check the browser console.

ericfranz commented 5 years ago

Using very useful rack-mini-profiler, here are numbers after adding profiling for getting around 1500 Owens jobs.

Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 12.6 +0.0  
Executing action: json 1.3 +11.0  
owens.job_adapter#info_all 72.5 +12.0  
Torque::Batch#get_jobs 506.1 +12.0  
generate Jobstatusdata for 1535 results 83.1 +591.0  
sorting jobs by username 137.9 +674.0  
render jobs as json 261.8 +812.0
Total 1075.4 ms
Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 11.8 +0.0
Executing action: json 1.0 +10.0
owens.job_adapter#info_all 70.0 +11.0
Torque::Batch#get_jobs 494.5 +11.0  
generate Jobstatusdata for 1537 results 72.4 +575.0
sorting jobs by username 115.3 +648.0
render jobs as json 208.5 +763.0
Total 973.4 ms

The bulk of work is in Torque::Batch#get_jobs. After adding this monkey patch so with the Torque C library we only get the attributes we will display in the table, instead of all the job attributes, we reduce the ~500ms method call to ~100ms, a huge improvement:

class OodCore::Job::Adapters::Torque::Batch
  alias_method :orig_get_jobs, :get_jobs
  def get_jobs(id: '', filters: [ :job_state, :queue, :Job_Name, :Account_Name, :job_id, :resources_used ])
    orig_get_jobs(id: id, filters: filters)
  end
end

Now the bulk of our time is spent after the info_all call: generating Jobstatusdata, sorting jobs by username, and rendering json:

Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 14.7 +0.0  
Executing action: json 1.0 +13.0  
owens.job_adapter#info_all 37.8 +14.0  
Torque::Batch#get_jobs 105.9 +14.0  
generate Jobstatusdata for 1538 results 47.8 +158.0  
sorting jobs by username 94.4 +206.0  
render jobs as json 227.1 +300.0  
Total 528.7 ms
Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 14.8 +0.0  
Executing action: json 1.0 +13.0  
owens.job_adapter#info_all 35.1 +14.0  
Torque::Batch#get_jobs 106.5 +14.0  
generate Jobstatusdata for 1538 results 51.3 +155.0  
sorting jobs by username 86.1 +206.0  
render jobs as json 244.9 +293.0  
Total 539.6 ms

Bypassing Jobstatusdata, and just doing a map over the info_all call to get the minimal json required is a further improvement:

Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 13.9 +0.0  
Executing action: json 0.9 +12.0  
owens.job_adapter#info_all 69.3 +13.0  
Torque::Batch#get_jobs 94.1 +13.0  
render 1505 jobs as json 119.6 +176.0  
Total 297.7 ms
Action duration (ms) from start (ms)
GET http://localhost:80/pun/dev/activejobs/js... 14.0 +0.0  
Executing action: json 0.9 +13.0  
owens.job_adapter#info_all 69.6 +13.0  
Torque::Batch#get_jobs 102.7 +13.0  
render 1502 jobs as json 136.3 +186.0
Total 323.5 ms

At this point however, the next steps to consider include:

  1. Follow suggestions at top, shifting responsibility of some of this code to the client
  2. Shift more of the responsibility to the ood_core (including the "native attrs" and "human readable strings" etc. for job details)
  3. Try a streaming approach as opposed to a read 100%. Does this have a negative impact?
  4. Need to repeat this work with other adapters. Using a Vagrant image with those adapters, systems with those adapters, or mock data can be helpful.
  5. Consider https://github.com/ohler55/oj after all. Though this was originally dropped from the Dashboard app due to concerns about adding extra dependencies.

This exploration also shows that:

  1. Optimizing this app's code will benefit all adapters
  2. It is worth spending some time with ood_core itself, and possibly writing an automated performance test with mock data.
  3. Even with this work, at some point it isn't worth optimizing too much further when we could explore a caching solution. That will be easier to implement in the coming year though than it is right now.