Closed ericfranz closed 7 years ago
- Focus only on the domain...
Your second option appears strange with a key and empty value (e.g., login:
and resouce_mgr:
). This could be made less strange along the lines of:
---
metadata:
title: Oakley
servers:
- id: login
host: "oakley.osc.edu"
- id: resource_mgr
type: torque
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
I am fine with the first two options, although I am unsure whether type:
is necessary but that can be fleshed out more when we have a "fuller" picture of the final product.
Maybe the solution is to provide a feature based config.
See how Rails handles database connection information in http://guides.rubyonrails.org/configuring.html#configuring-a-database for some inspiration.
Without having separate connection for each environment, the database.yml
file would look like:
---
adapter: sqlite3
database: db/development.sqlite3
pool: 5
timeout: 5000
or
---
adapter: postgresql
encoding: unicode
database: blog_development
pool: 5
or
---
adapter: mysql2
encoding: utf8
database: blog_development
pool: 5
username: root
password:
socket: /tmp/mysql.sock
So in a cluster config, instead of have a config describe the servers available and the connection info to connect to them, (i.e. these are the resource managers available, this is the scheduler available, etc.), perhaps it is appropriate to just specify a separate config block for each feature that requires connection info (such as jobs and reservations).
For example:
metadata:
title: Oakley
login: "oakley.osc.edu"
jobs:
adapter: torque
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
reservations:
adapter: torque+moab
torque:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
moab:
host: "oak-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
The drawbacks is we lose the idea of a config describing a cluster and its resources and then multiple features enabling automatically based on what is available. Yaml will have duplicate connection information in multiple places. The benefits are:
adapter:
could also specify a class name (capitalized first word).resource_mgr
, it could filter out clusters that don't specify a jobs
config. In fact, it would now be possible, given a config, to ask if a corresponding Adapter object exists for this config. "My Jobs" could filter out clusters that can't find a corresponding Adapter.We can of course just start with class names instead of using keywords:
metadata:
title: Oakley
login: "oakley.osc.edu"
jobs:
adapter: "OodJob::Adapters::Torque"
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
reservations:
adapter: "OodReservations::Queries::TorqueMoab"
torque:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
moab:
host: "oak-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
With this approach, we would turn the original:
---
v1:
title: "Oakley"
cluster:
type: "OodCluster::Cluster"
data:
servers:
login:
type: "OodCluster::Servers::Ssh"
data:
host: "oakely.osc.edu"
resource_mgr:
type: "OodCluster::Servers::Torque"
data:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
into
metadata:
title: Oakley
login: "oakley.osc.edu"
jobs:
adapter: "OodJob::Adapters::Torque"
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
I do prefer the keyword approach as default and would like to investigate exactly how Rails manages this configuration and associates the keywords with the gems or classes that get instantiated.
metadata:
title: Oakley
login: "oakley.osc.edu"
jobs:
adapter: torque
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
Login might be better a list of key value pairs too:
metadata:
title: Oakley
login:
host: "oakley.osc.edu"
jobs:
adapter: torque
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
I like this approach:
metadata:
title: Oakley
login:
host: "oakley.osc.edu"
jobs:
class: "OodJob::Adapters::Torque"
opts:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
That way OodAppkit doesn't need to know the implementation of the corresponding object. It could create the object as something along the lines of:
class ClusterDecorator ...
...
# Exception raised if adapter is not specified or is missing
class MissingAdapter < StandardError; end
def has_jobs?
@config.has_key?('jobs')
end
def jobs(opts = {})
return nil if !has_jobs? || !@config['jobs']['adapter']
@config['jobs']['adapter'].constantize.new((@config['jobs']['opts'] || {}).merge opts)
rescue NameError => e
raise MissingAdapter, e.message
end
end
Also we'd have to include things such as validation...
metadata:
title: Oakley
login:
host: "oakley.osc.edu"
reservations:
class: "OodReservations::Queries::TorqueMoab"
opts:
torque:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
moab:
host: "oak-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
validators:
- class: "OodAppkit::Validators::Groups"
opts:
groups:
- "sysp"
- "hpcsoft"
allow: false
one of the possibilities for the ClusterDecorator
would then be:
class ClusterDecorator ...
...
def reservations_valid?
return false unless has_reservations?
@config['reservations'].fetch(:validators, []).all? do |v|
v.fetch('class', 'OpenStruct').constantize.new(v['opts'] || {}).success?
end
end
end
Could definitely clean up the code, but that is one possible idea.
That way OodAppkit doesn't need to know the implementation of the corresponding object.
OodAppkit doesn't need to know regardless of whether we do "torque" or "OodJob::Adapters::Torque". A factory method on ood_job could handle this ("torque" to "OodJob::Adapters::Torque"). Actually, it might be better to just have OodAppkit provide an object that gives you access to this config information. There is no ClusterDecorator#adapter
. Rather, ood_job_rails
instantiates the adapter, and uses the config provided by OodAppkit.
Its a better user experience to write a config like this:
metadata:
title: Oakley
login:
host: "oakley.osc.edu"
jobs:
adapter: torque
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
as opposed to this:
metadata:
title: Oakley
login:
host: "oakley.osc.edu"
jobs:
class: "OodJob::Adapters::Torque"
opts:
host: "oak-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
And my guess is the majority case will be "torque" or "slurm" not a custom adapter.
Also, using a completely flat hash does eliminate the option of having the keyword "adapter" be an argument to the adapter class. Thats okay. There are no cases in our current code that require this and for the few possible cases in the future that would want to specify an "adapter" argument its easy to be more specific and prefix the argument keyword with what type of adapter.
@nickjer I'm going to chime in here and say that I feel the flatter format without the class names in it looks more like a configuration file should be. The class names smells too strongly of implementation details leaking into somewhere it doesn't belong, whereas the flat config format is more about specifying the information that is crucial.
If you think this merits further discussion, I'm happy if someone wants to schedule a meeting so all the merits of both sides can be discussed.
There are a couple issues with the flat design. The first is gleaned from the following statement:
Rather,
ood_job_rails
instantiates the adapter, and uses the config provided by OodAppkit.
This means that either the underlying library knows about OodAppkit and how it organizes the information it needs to instantiate itself, or we create a second 'rails' specific library for every underlying library (e.g., ood_job_rails
). The latter option means that ood_reservations
would have an ood_reservations_rails
gem as well that required ood_appkit
.
So each resource would need a factory library that includes both the resource library and the configuration library used to generate the resource object from a configuration object.
resource_library (ood_job) <=> factory_library (ood_job_rails) <=> config_library (ood_appkit)
The other issue is that the flat design doesn't address user authorization for a given cluster resource. The examples being:
And my guess is the majority case will be "torque" or "slurm" not a custom adapter.
I am not confident enough to make that statement, I err on the side of caution and feel it is better to be initially flexible and after we have more experience with the various centers' infrastructures to then introduce generic torque
and slurm
keywords.
I am not opposed to the keywords torque
and slurm
, but not confident enough that we disable support of class names and remove that flexibility. Also, I will have to look into how Rails handles the db keywords, as well as how it would handle a custom db adapter.
If you think this merits further discussion, I'm happy if someone wants to schedule a meeting so all the merits of both sides can be discussed.
I am fine with a deep-dive on this. Although I'll let @ericfranz schedule it if he feels it is also necessary.
@nickjer is this how validations work? Did I miss anything?
Two classes:
Used this way by specifying in Cluster config:
validators:
rsv_query:
- type: "OodAppkit::Validators::Groups"
data:
groups:
- "sysp"
- "hpcsoft"
allow: false
or
validators:
cluster:
- type: "OodAppkit::Validators::Groups"
data:
groups:
- "ruby"
allow: true
rsv_query:
- type: "OodAppkit::Validators::Groups"
data:
groups:
- "sysp"
- "hpcsoft"
allow: false
In code:
Each cluster has one or more validators:
# @param validators [Hash{#to_sym=>Array<Validator>}] hash of validators
def initialize(cluster:, id:, title: "", url: "", validators: {}, **_)
Validations occur when calling ClusterDecorator#valid?
:
# Whether the given method is valid (i.e., passes all supplied validators)
# @param method [#to_sym] method to check if valid
# @return [Boolean] whether this method is valid
def valid?(method = :cluster)
@validators.fetch(method.to_sym, []).all? { |v| v.success? }
end
the argument method
in valid?(method = :cluster)
acts as a "scope" on the validators to choose which validator array to use
Usage:
valid?
to filter out cluster instances users don't have access tovalid?(:rsv_query)
to filter out cluster instances that should not be included in the reservation query. This way we can prevent certain users in groups from querying reservations, which causes problems.To Add a New Filter, these are the steps.
OodAppkit::Validator
subclass. Class must be added to RUBY_PATH
of a Ruby Passenger app (whether its part of the app code, or added to the gem, or added some other way). Does not work with non Ruby Passenger apps.OodAppkit::Validator
subclass.@nickjer is this how validations work?
Yes.
Did I miss anything?
Nothing that stands out right off the top of my head.
This is how the config currently works. Notice we violate "what changes together should go together".
Here is an example config for OSC Ruby cluster:
---
v1:
title: "Ruby"
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
cluster:
- type: "OodAppkit::Validators::Groups"
data:
groups:
- "ruby"
allow: true
rsv_query:
- type: "OodAppkit::Validators::Groups"
data:
groups:
- "sysp"
- "hpcsoft"
allow: false
cluster:
type: "OodCluster::Cluster"
data:
hpc_cluster: true
servers:
login:
type: "OodCluster::Servers::Ssh"
data:
host: "ruby.osc.edu"
resource_mgr:
type: "OodCluster::Servers::Torque"
data:
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.1"
scheduler:
type: "OodCluster::Servers::Moab"
data:
host: "ruby-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
ganglia:
type: "OodCluster::Servers::Ganglia"
data:
host: "cts05.osc.edu"
scheme: "https://"
segments:
- "gweb"
- "graph.php"
req_query:
c: "Ruby"
opt_query:
h: "%{h}.ten.osc.edu"
version: "3"
v1:
is to "version" the config, since right now the config is a moving targettitle
and url
attributes are set on ClusterDecorator
validators
are set on OodAppkit::ClusterDecorator
and used for ClusterDecorator#valid?
cluster
provides the config for OodCluster::Cluster
instantiation; we instantiate OodCluster::Cluster
instance as specified and pass data
as argumentshpc_cluster
is the one attribute that is set on OodCluster::Cluster
itself: https://github.com/OSC/ood_cluster/blob/0eb898639a0de94146402ec7b2979b2eec9dd949/lib/ood_cluster/cluster.rb#L27-L29servers:
is a "hash" of servers to create; OodCluster::Cluster
has named servers: torque, moab, ssh, ganglia https://github.com/OSC/ood_cluster/tree/0eb898639a0de94146402ec7b2979b2eec9dd949/lib/ood_cluster/serversOodCluster::Servers::Server
subclass defines attributes that can be setood_job
, the dashboard, ood_reservations
, individual apps use various server instances, accessing by keysganglia
, scheduler
, resource_mgr
, login
but not in the ood_cluster gem itselfOodCluster::Cluster
are auto-generated i.e. ganglia_server?
and login_server?
Details:
OodAppkit::ConfigParser
https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/config_parser.rb reads the yaml file and create a ClusterDecorator
object for each cluster config file, placing them in an OodAppkit::Clusters
list.OodAppkit.clusters
attribute: https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/configuration.rb#L67-L76OodAppkit.clusters
on loadOodAppkit.clusters
(which is our config object) to connect to and provide access to resources outside of the appExamples:
In ApplicationHelper:
def clusters
OodAppkit::Clusters.new(OodAppkit.clusters.select(&:valid?).select(&:hpc_cluster?))
end
def login_clusters
OodAppkit::Clusters.new(clusters.select(&:login_server?))
end
In ERB view:
<% elsif app.role == "shell" %>
<%= nav_link("Shell Access", "terminal", OodAppkit.shell.url, target: "_blank") if login_clusters.count == 0 %>
<% login_clusters.each do |c| %>
<%= nav_link "#{c.title} Shell Access", "terminal", OodAppkit.shell.url(host: c.login_server.host), target: "_blank" %>
<% end %>
In initializer:
# config/initializers/ood_appkit.rb
OODClusters = OodAppkit.clusters.select do |c|
c.valid? && c.hpc_cluster? && c.resource_mgr_server? && c.resource_mgr_server.is_a?(OodCluster::Servers::Torque)
end.each_with_object({}) { |c, h| h[c.id] = c }
# the controller will update status manually
OscMacheteRails.update_status_of_all_active_jobs_on_each_request = false
In the rest of the app, using OODClusters
global: https://github.com/OSC/ood-myjobs/search?utf8=%E2%9C%93&q=OODClusters
# app/models/manifest.rb
def default_host
OODClusters.first ? OODClusters.first[0].to_s : ""
end
In app/views/workflows/_form.html.erb:
<%= f.select :batch_host, OODClusters.map { |key, val| [ "#{val.title} (#{val.resource_mgr_server.host})", key ] }, { label: "Batch Server:" }, { class: "selectpicker", id: "batch_host_select", required: true } %>
ood_job
instance in "My Jobs"See ResourceMgrAdapter
https://github.com/OSC/ood-myjobs/blob/04eca4613d7493333c7ec90ca2d5ecec1996fc7d/app/models/resource_mgr_adapter.rb#L54-L59:
Get cluster instance for cluster id:
def cluster_for_host_id(host)
raise PBS::Error, "host nil" if host.nil?
raise PBS::Error, "host is invalid value: #{host}" unless OODClusters.has_key?(host.to_sym)
OODClusters[host.to_sym]
end
Using the cluster instance to instantiate an ood_job
adapter instance:
def adapter
OodJob::Adapters::Torque
end
def qsub(script_path, host: nil, depends_on: {}, account_string: nil)
script_path = Pathname.new(script_path)
raise OSC::Machete::Job::ScriptMissingError, "#{script_path} does not exist or cannot be read" unless script_path.file? && script_path.readable?
cluster = cluster_for_host_id(host)
script = OodJob::Script.new(content: script_path.read, accounting_id: account_string)
adapter.new(cluster: cluster).submit(script: script, **depends_on)
rescue OodJob::Adapter::Error => e
raise PBS::Error, e.message
end
Notice, we currently have the issue of hardcoding the adapter to use, OodJob::Adapters::Torque
.
ood_job
instance in ood_job_rails
Note: the final version of ood_job_rails
is not yet determined.
Clusters is set in the adapter initializer:
# OodJobRails::Adapter
def initialize(clusters: OodAppkit.clusters, default_script: OodJobRails.default_script)
@clusters = clusters
@default_script = default_script.to_h
end
Default ood_job adapter that is used is specified during app initialization:
https://github.com/OSC/ood_job_rails/blob/a03167f40b2ef9ca6dc44db3022daa707324cab7/lib/ood_job_rails/configuration.rb
def set_default_configuration
self.adapter = OodJob::Adapters::Torque # sets OodJobRails.adapter
self.default_script = {}
end
# code to submit in OodJobRails::Adapter
def submit(cluster_id:, script:, after: [], afterok: [], afternotok: [], afterany: [], **_)
cluster = clusters[cluster_id]
script = OodJob::Script.new default_script.merge(script.to_h)
OodJobRails.adapter.new(cluster: cluster).submit(
script: script,
after: after,
afterok: afterok,
afternotok: afternotok,
afterany: afterany
)
rescue OodJob::Adapter::Error => e
raise Error, e.message
end
How reservations work:
In vncsim, we filter out the clusters who have a validation scoped to :rsv_query
that return true for the user AND
OodReservations itself uses multiple Cluster servers (both Torque and Moab):
# instantiate a reservations query for a cluster
OodReservations::Query.build(cluster: c)
build
actually returns a query instance only if the cluster has all the required resource managers as specified in this code:
# in OodReservations::Query
def self.build(**kwargs)
if Queries::TorqueMoab.match(**kwargs)
Queries::TorqueMoab.new(**kwargs)
else
nil
end
end
# in OodReservations::Queries::TorqueMoab
def self.match(cluster:, **_)
cluster.resource_mgr_server? &&
cluster.scheduler_server? &&
cluster.resource_mgr_server.is_a?(OodCluster::Servers::Torque) &&
cluster.scheduler_server.is_a?(OodCluster::Servers::Moab)
end
We discussed offline. This is the suggested approach:
ood_cluster
, ood_job
, ood_reservations
and ood_appkit
into a single gem ood_core
which is not a Rails Engine.ood_core/reservations
and specific adapters under ood_core/reservations/adapters
ood_core/jobs
and the specific adapters to ood_core/jobs/adapters
OodCore::Cluster
will not be accompanied by any other special classes like "OodCluster::Server" or subclasses, and instead will have these methods:
#jobs
- a hash of connection information, including the adapter type to load#jobs_adapter
- instantiates an adapter (and requires the adapter code necessary) based on the connection information, passing in the connection information to the adapter it instantiates#jobs?
- whether, given conn info, an adapter is available; also applies validations with contexts :cluster
and :jobs
#reservations
- a hash of connection information, including the adapter type to load#reservations_adapter
- instantiates an adapter (and requires the adapter code necessary) based on the connection information, passing in the connection information to the adapter it instantiates#reservations?
- whether, given conn info, an adapter is available; also applies validations with contexts :cluster
and :reservations
#login
- hash of information required for connection i.e. { host: "oakley.osc.edu" }#login?
- whether the connection info required is available#id
- cluster id, like before i.e. "oakley, "ruby"#metadata
- ? or we just put this on the cluster itself, like #title
etc.Cluster#jobs_adapter
or Cluster#reservations_adapter
. All jobs_adapters must be under OodJob::Adapters
module.Example:
def jobs_adapter
# TODO: or return a NullAdapter object
return nil unless jobs?
require "ood_job/adapters/#{jobs.type}" # slurm
"OodJob::Adapters::#{jobs.type.classify}".constitize.build(jobs)
end
Need to decide whether Cluster#jobs
, Cluster#login
, Cluster#reservations
return Struct, OpenStruct, Hash, or something else. Also, methods like Cluster#jobs?
seem required to load the code in order to determine it exists. Not sure about that.
This new design does these things:
Inspiration is from Rails itself:
https://github.com/rails/rails/blob/ecca24e2d76f647f342e6bdf8c68a693ff49ae9a/activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb#L14-L44 and https://github.com/rails/rails/blob/ecca24e2d76f647f342e6bdf8c68a693ff49ae9a/activerecord/lib/active_record/connection_adapters/connection_specification.rb#L170-L191
So what does the final polished yaml file look like?
Is this it?
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
rsvs:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
login: "ruby.osc.edu"
jobs:
adapter: torque
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
rsvs:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
Note that Validators can probably be replaced with OodSupport::Acl
as it pretty much does the same thing. But maybe for the future.
Also, what do we do with #hpc_cluster?
. Currently the only apps that use this are the Dashboard, MyJobs, and SystemStatus apps as they want to loop through all the clusters. Although I am not entirely sure the Dashboard even needs this as it seems to only care for login servers. Also SystemStatus may not need it as well since it looks for clusters with Ganglia support. Maybe support for this should be removed, and the MyJobs app just have a .env.local
file that a sysadmin can blacklist specific clusters.
Another question, I noticed the SystemStatus app looks for a cluster with a scheduler. How will this be supported in the new format?
Another question, I noticed the SystemStatus app looks for a cluster with a scheduler. How will this be supported in the new format?
Maybe we handle this later the same way we handled ood_job
with a universal interface called ood_scheduler
. But further down the road.
Until then, the SystemStatus app may have to parse out the moab settings in OodCluster#rsvs
.
Maybe support for this should be removed, and the MyJobs app just have a .env.local file that a sysadmin can blacklist specific clusters.
Lets use a separate section for vdi
. If we think of configuring features. Then the quick cluster config may omit both login
and jobs
sections but have a vdi
section. The config would look like this:
---
metadata:
title: "Quick"
vdi:
adapter: torque
host: "quick-batch.ten.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
version: "6.0.2"
As for SystemStatus app using scheduler... what "generic version" of the data that is being used for SystemStatus is not determined. It is a moab and ganglia specific app right now. The config above essentially is shifting from describing "servers available" and the config to connect to them to describing "features" and the config to use those features.
So just providing conn info for moab and ganglia might make the most sense.
I guess this is the challenge. If you are choosing a specific adapter to use, instead of letting the config indicate which adapter to use, how do you get connection information from the config. One way would be to optionally provide configs for specific servers like before.
Something like:
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
rsvs:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
login: "ruby.osc.edu"
jobs:
adapter: torque
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
rsvs:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
# conn info for specific servers that apps that are defined to use these servers can pull from
moab:
host: "oak-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
ganglia:
host: "cts05.osc.edu"
scheme: "https://"
segments:
- "gweb"
- "graph.php"
req_query:
c: "Ruby"
opt_query:
h: "%{h}.ten.osc.edu"
version: "3"
Also I'm not opposed to this:
rsvs:
adapter: torque_moab
torque:
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
moab:
host: "ruby-batch.osc.edu"
bin: "/opt/moab/bin"
homedir: "/var/spool/moab"
in place of
rsvs:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
I guess then we could define servers in the config below, and have them also be anchors, and then use the anchors in the feature sections above like rsvs
and jobs
.
But lets use the word reservations
instead of the abbreviation rsvs
:
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
- rsvs:
+ reservations:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
login: "ruby.osc.edu"
jobs:
adapter: torque
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
-rsvs:
+reservations:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
Note that this:
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
reservations:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
can be done like this too:
validators:
default:
- adapter: groups
groups: [ ruby ]
allow: true
reservations:
- adapter: groups
groups: [ sysp, hpcsoft ]
allow: false
Would be more readable if instead of
validators:
default:
- adapter: groups
groups: [ ruby ]
allow: true
reservations:
- adapter: groups
groups: [ sysp, hpcsoft ]
allow: false
we did
validators:
default:
- adapter: groups_whitelist
groups: [ ruby ]
reservations:
- adapter: groups_blacklist
groups: [ sysp, hpcsoft ]
The config above essentially is shifting from describing "servers available" and the config to connect to them to describing "features" and the config to use those features.
So just providing conn info for moab and ganglia might make the most sense.
I thought these "features" would be properly defined in the Cluster
object. So there will be a #jobs
and #metadata
. Are we saying that all clusters will have a #moab
method as well? This may lead to an explosion of methods with no proper documentation on all the features available. Maybe the Moab connection settings should be namespaced under a more generic convention like #native
...
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
rsvs:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
login: "ruby.osc.edu"
jobs:
adapter: torque
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
rsvs:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
native:
moab:
host: "oak-batch.osc.edu"
bin: "/opt/moab/bin"
version: "9.0.1"
moabhomedir: "/var/spool/moab"
ganglia:
host: "cts05.osc.edu"
scheme: "https://"
segments:
- "gweb"
- "graph.php"
req_query:
c: "Ruby"
opt_query:
h: "%{h}.ten.osc.edu"
version: "3"
And as we introduce more generic library interfaces like ood_scheduler
then we pull it out of native and make it a "feature".
Lets use a separate section for
vdi
This can work but may be confusing to both sys admins and developers. They may both be expecting a #jobs
as Quick is not so different from a regular batch server. The scaffolding that builds out the Job model would also have to know whether to use #jobs
or #vdi
depending on the cluster you intend to submit to. Also the job-status app would have to know about this as well since it will use #jobs
to get the status of all the jobs on the clusters.
I think it introduces too many complications just to specify a cluster that is very OSC-specific.
I like the idea of native
. But instead of calling it native
lets call it custom
. The other sections are predefined groups of connection information, and custom would contain custom groups of connection information.
Regarding VDI, so instead we do need a "cluster type".
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
type: vdi
But we could have the default type be "cluster" or "hpc" and do @cluster.type == :vdi
or @cluster.type_vdi?
Actually, the problem with that is in the future we will probably support running VDI GUI apps by running processes locally.
I can make a custom
section. Although we can talk further about the VDI issue in the group meeting today. As my concern revolves around the fact that only one app may need this information.
Or we omit custom and just put custom alongside other app configs.
For example:
metadata:
title: Ruby
url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
default:
- adapter: groups
groups:
- "ruby"
allow: true
rsvs:
- adapter: groups
groups:
- "sysp"
- "hpcsoft"
allow: false
login: "ruby.osc.edu"
jobs:
adapter: torque
host: "ruby-batch.osc.edu"
lib: "/opt/torque/lib64"
bin: "/opt/torque/bin"
rsvs:
adapter: torque_moab
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
systemstatusapp:
torque_host: "ruby-batch.osc.edu"
torque_lib: "/opt/torque/lib64"
torque_bin: "/opt/torque/bin"
moab_host: "ruby-batch.osc.edu"
moab_bin: "/opt/moab/bin"
moab_homedir: "/var/spool/moab"
Thats just as easy.
VNCSim
gets app from uri path /pun/sys/vncsim/:app_token
(e.g., /pun/sys/vncsim/sys/bc_osc_desktop
):
VNCApp::Application.routes.draw do
scope ':app_token', constraints: { app_token: /((usr\/[^\/]+)|dev|sys)\/[^\/]+/ } do
resources :sessions, only: [:index, :show, :create, :destroy]
end
end
sets app in model:
class App
def self.from_token(token)
ary = token.split('/')
type = ary.first
owner = ary[1] if ary.size == 3
name = ary.last
new(type: type, owner: owner, name: name)
end
def initialize(type:, owner: nil, name:)
@type = type
@owner = owner
@name = name
end
...
end
session models take this app
object as an initialization parameter to define a session
object
class Session
# The app this session is modeled around
attr_accessor :app
# Find all submitted jobs by user that are not
# completed and have specified job name
def self.all(app)
_search(app, nil) do |q|
q.select do |k, v|
/^#{ENV['USER']}@/ =~ v[:Job_Owner] && /^#{ENV['APP_TOKEN']}\/#{app.token}\/(?!.*!$)/ =~ v[:Job_Name]
end
end.sort
end
# Find submitted session based on job id
def self.find(app, id)
_search(app, id.gsub('_', '.')) {|q| q}.first
end
# Sub-app hash
def sub_app
app.config.fetch(batch_type, {}).fetch(app_idx, nil)
end
# Name of this sub-app
def name
sub_app['name']
end
...
end
Notice that an app can have multiple sub_apps defined in its app.yml
:
title: 'Abaqus/CAE'
compute:
- &abaqus
name: 'Abaqus/CAE 6.14 (Software Rendering)'
batch: 'oakley'
node: &software_nodes
- type: 'any'
ppn: 12
desc: >
Choose any type of compute node. This reduces the wait time
as there are no node specific requirements.
- type: 'bigmem'
ppn: 12
desc: >
This node has 192GB of available RAM. There are only 8 of
these nodes on Oakley.
- type: 'hugemem'
ppn: 32
desc: >
This node has 1TB of available RAM as well as 32 cores.
There is only 1 of these nodes on Oakley. A reservation may
be required to use this node.
pbs:
envvars:
ABAQUS_MODULE: 'abaqus/6.14'
GPU_OFF: '1'
resources: &resources
software: 'abaqus+#{(5 * (nodes.to_i * node_ppn.to_i) ** 0.422).floor}'
- <<: *abaqus
name: 'Abaqus/CAE 6.14 (Hardware Rendering)'
node: &hardware_nodes
- title: 'vis'
type: 'vis:gpus=1'
ppn: 12
desc: >
This node may come with 1 to 2 Nvidia GPUs. Allows for 3D
visualization software to run as well as CUDA computations.
pbs:
envvars:
ABAQUS_MODULE: 'abaqus/6.14'
resources: *resources
- <<: *abaqus
name: 'Abaqus/CAE 2016 (Software Rendering)'
node: *software_nodes
pbs:
envvars:
ABAQUS_MODULE: 'abaqus/2016'
GPU_OFF: '1'
resources: *resources
- <<: *abaqus
name: 'Abaqus/CAE 2016 (Hardware Rendering)'
node: *hardware_nodes
pbs:
envvars:
ABAQUS_MODULE: 'abaqus/2016'
resources: *resources
Each sub_app is namespaced by batch_type
(compute
or shared
) and app_idx
(the index of the sub app in the array).
Now for a given batch_type
and app_idx
you have methods Session#name
, Session#path
, Session#batch
, ... that are read from the app's app.yml
that you see in each app repo.
In the views:
batch_type
defines the form to use as well as the display panel (e.g., do we request or show walltime)app_idx
just configures the form selection boxes with options (e.g., what node types can the user choose)A sub_app session can be submitted with:
class Session
# Submit this model to the PBS batch
def submit
h = {
PBS::ATTR[:N] => "#{ENV['APP_TOKEN']}/#{app.token}/#{name.parameterize}",
PBS::ATTR[:o] => output_dir.join("$PBS_JOBID.output").to_s,
PBS::ATTR[:j] => "oe",
PBS::ATTR[:S] => "/bin/bash",
PBS::ATTR[:m] => mail.to_i.zero? ? "n" : "b",
PBS::ATTR[:init_work_dir] => output_dir
}.merge headers.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }
r = {
nodes: "#{nodes}:ppn=#{node_ppn}#{node_type_pbs}",
walltime: "#{hours}:00:00",
}.merge resources.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }
e = {
vnc_batch_type: batch_type,
vnc_app_idx: app_idx,
vnc_node_idx: node_idx,
ROOT: staged_dir
}.merge envvars.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }
# Add project account if specified
h = h.merge(PBS::ATTR[:A] => account) unless account.blank?
osc_session.submit headers: h, resources: r, envvars: e
true
rescue PBS::Error => e
msg = "<b>Failed to submit batch job:</b><pre>#{e.message}</pre>"
errors.add(:batch, msg)
Rails.logger.error(msg)
false
end
end
and a sub_app session can be read from qstat
by:
class Session
# Set a session implicitly through a PBS batch status hash
def _set_from_query(id, attribs)
self.pbsid = id.to_s # 329083.oak-batch.osc.edu
# Get attributes for this job
self.status = attribs.fetch(PBS::ATTR[:state]) # R
self.nodes = attribs.fetch(PBS::ATTR[:l]).fetch(:nodes).split(":")[0] # 1:ppn=12
self.hours = attribs.fetch(PBS::ATTR[:l]).fetch(:walltime).split(":")[0].to_i # '01:00:00'
self.created_time = attribs.fetch(PBS::ATTR[:ctime]).to_i # '103908239'
self.time_left = attribs.fetch(:Walltime, {}).fetch(:Remaining, '0').to_i # '3909'
# Parse env vars for app info
envvars = attribs.fetch(PBS::ATTR[:v]).split(",").inject({}) do |h,s|
k,v = s.split("=")
h[k] = v
h
end
self.batch_type = envvars['vnc_batch_type']
self.app_idx = envvars['vnc_app_idx']
self.node_idx = envvars['vnc_node_idx']
# Get number of cores after finding app
self.cores = nodes.to_i * node_ppn
# Get osc-vnc session for this job
_get_osc_vnc_session
self
end
end
This is implemented in v2
of the cluster config.
When documenting the cluster config for OOD, I noticed the config seemed unnecessarily complex for what it was describing:
https://github.com/OSC/Open-OnDemand/tree/5ae6b125ed0a0b1e33b811e9f5bf93a28bff62c5#26---add-cluster-connection-config-files
For a basic install, a cluster file needs to provide 3 pieces of information:
An example of this currently is:
Also, the the steps of adding support for submitting jobs to Slurm instead of Torque occurs in (potentially) three places:
Essentially, the "type definition" of an
OodCluster::Servers::Server
is achieved by creating subclasses.Two thoughts.
Focus only on the domain of providing connection information configuration for each cluster. What we have is essentially a list of servers that the cluster provides and connection information for each, and then some metadata. Connection information is just a list of attributes (key value pairs).
or
or even
If we ignore the complexity in OodCluser::Servers::Ganglia, we find that OodCluster Server types add only 2 constraints:
Type definition might be better done through data instead of custom subclasses.