Simpler cluster config - Githubissues

OSC / ood_appkit

https://osc.github.io/Open-OnDemand/

MIT License

1 stars 2 forks source link

Simpler cluster config #36

Closed ericfranz closed 7 years ago

ericfranz commented 7 years ago

When documenting the cluster config for OOD, I noticed the config seemed unnecessarily complex for what it was describing:

https://github.com/OSC/Open-OnDemand/tree/5ae6b125ed0a0b1e33b811e9f5bf93a28bff62c5#26---add-cluster-connection-config-files

For a basic install, a cluster file needs to provide 3 pieces of information:

a title (metadata)
the login host for SSH Access
connection information for the server(s) for job management

An example of this currently is:

---
v1:
  title: "Oakley"
  cluster:
    type: "OodCluster::Cluster"
    data:
      servers:
        login:
          type: "OodCluster::Servers::Ssh"
          data:
            host: "oakely.osc.edu"
        resource_mgr:
          type: "OodCluster::Servers::Torque"
          data:
            host: "oak-batch.osc.edu"
            lib: "/opt/torque/lib64"
            bin: "/opt/torque/bin"
            version: "6.0.1"

Also, the the steps of adding support for submitting jobs to Slurm instead of Torque occurs in (potentially) three places:

Add an OodCluster::Servers:Slurm class to the ood_cluster gem
Add a OodJob::Adapters::Slurm class to the ood_job gem
Update local configs to specify classes objects to use
??? Update the code required for instantiating the right OodJob adapter instance (how we will do this has not yet been determined)

Essentially, the "type definition" of an OodCluster::Servers::Server is achieved by creating subclasses.

Two thoughts.

Focus only on the domain of providing connection information configuration for each cluster. What we have is essentially a list of servers that the cluster provides and connection information for each, and then some metadata. Connection information is just a list of attributes (key value pairs).

---
metadata:
  title: Oakley
servers:
  login:
    host: "oakley.osc.edu"
  resource_mgr:
    type: torque
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"

---
metadata:
  title: Oakley
servers:
  - login:
    host: "oakley.osc.edu"
  - resource_mgr:
    type: torque
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"

or even

---
metadata:
  title: Oakley
login: "oak-batch.osc.edu"
jobs:
  type: torque
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

If we ignore the complexity in OodCluser::Servers::Ganglia, we find that OodCluster Server types add only 2 constraints:
1. what attributes are required
2. the Ruby type to coerce the attribute values to (i.e. Pathname)

Type definition might be better done through data instead of custom subclasses.

nickjer commented 7 years ago

Focus only on the domain...

Your second option appears strange with a key and empty value (e.g., login: and resouce_mgr:). This could be made less strange along the lines of:

---
metadata:
  title: Oakley
servers:
  - id: login
    host: "oakley.osc.edu"
  - id: resource_mgr
    type: torque
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"

I am fine with the first two options, although I am unsure whether type: is necessary but that can be fleshed out more when we have a "fuller" picture of the final product.

ericfranz commented 7 years ago

Maybe the solution is to provide a feature based config.

See how Rails handles database connection information in http://guides.rubyonrails.org/configuring.html#configuring-a-database for some inspiration.

Without having separate connection for each environment, the database.yml file would look like:

---
adapter: sqlite3
database: db/development.sqlite3
pool: 5
timeout: 5000

---
adapter: postgresql
encoding: unicode
database: blog_development
pool: 5

---
adapter: mysql2
encoding: utf8
database: blog_development
pool: 5
username: root
password:
socket: /tmp/mysql.sock

So in a cluster config, instead of have a config describe the servers available and the connection info to connect to them, (i.e. these are the resource managers available, this is the scheduler available, etc.), perhaps it is appropriate to just specify a separate config block for each feature that requires connection info (such as jobs and reservations).

For example:

metadata:
  title: Oakley
login: "oakley.osc.edu"
jobs:
  adapter: torque
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"
reservations:
  adapter: torque+moab
  torque:
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"
  moab:
    host: "oak-batch.osc.edu"
    bin: "/opt/moab/bin"
    version: "9.0.1"
    moabhomedir: "/var/spool/moab"

The drawbacks is we lose the idea of a config describing a cluster and its resources and then multiple features enabling automatically based on what is available. Yaml will have duplicate connection information in multiple places. The benefits are:

We move the responsibility of specifying the adapter to use and providing the arguments for the adapter to the config itself. If a feature
It becomes possible now to build a simple parser. Existing adapters in gems can have a "keyword" (lowercase). But adapter: could also specify a class name (capitalized first word).
Instead of "My Jobs" filtering out clusters for those that define a resource_mgr, it could filter out clusters that don't specify a jobs config. In fact, it would now be possible, given a config, to ask if a corresponding Adapter object exists for this config. "My Jobs" could filter out clusters that can't find a corresponding Adapter.
We remove assumptions about what adapters require (i.e. job adapters require a resource_mgr)

We can of course just start with class names instead of using keywords:

metadata:
  title: Oakley
login: "oakley.osc.edu"
jobs:
  adapter: "OodJob::Adapters::Torque"
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"
reservations:
  adapter: "OodReservations::Queries::TorqueMoab"
  torque:
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"
  moab:
    host: "oak-batch.osc.edu"
    bin: "/opt/moab/bin"
    version: "9.0.1"
    moabhomedir: "/var/spool/moab"

With this approach, we would turn the original:

---
v1:
  title: "Oakley"
  cluster:
    type: "OodCluster::Cluster"
    data:
      servers:
        login:
          type: "OodCluster::Servers::Ssh"
          data:
            host: "oakely.osc.edu"
        resource_mgr:
          type: "OodCluster::Servers::Torque"
          data:
            host: "oak-batch.osc.edu"
            lib: "/opt/torque/lib64"
            bin: "/opt/torque/bin"
            version: "6.0.1"

into

metadata:
  title: Oakley
login: "oakley.osc.edu"
jobs:
  adapter: "OodJob::Adapters::Torque"
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

I do prefer the keyword approach as default and would like to investigate exactly how Rails manages this configuration and associates the keywords with the gems or classes that get instantiated.

metadata:
  title: Oakley
login: "oakley.osc.edu"
jobs:
  adapter: torque
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

ericfranz commented 7 years ago

metadata:
  title: Oakley
login:
  host: "oakley.osc.edu"
jobs:
  adapter: torque
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

nickjer commented 7 years ago

I like this approach:

metadata:
  title: Oakley
login:
  host: "oakley.osc.edu"
jobs:
  class: "OodJob::Adapters::Torque"
  opts:
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"

That way OodAppkit doesn't need to know the implementation of the corresponding object. It could create the object as something along the lines of:

class ClusterDecorator ...
  ...

  # Exception raised if adapter is not specified or is missing
  class MissingAdapter < StandardError; end

  def has_jobs?
    @config.has_key?('jobs')
  end

  def jobs(opts = {})
    return nil if !has_jobs? || !@config['jobs']['adapter']
    @config['jobs']['adapter'].constantize.new((@config['jobs']['opts'] || {}).merge opts)
  rescue NameError => e
    raise MissingAdapter, e.message
  end
end

Also we'd have to include things such as validation...

metadata:
  title: Oakley
login:
  host: "oakley.osc.edu"
reservations:
  class: "OodReservations::Queries::TorqueMoab"
  opts:
    torque:
      host: "oak-batch.osc.edu"
      lib: "/opt/torque/lib64"
      bin: "/opt/torque/bin"
      version: "6.0.1"
    moab:
      host: "oak-batch.osc.edu"
      bin: "/opt/moab/bin"
      version: "9.0.1"
      moabhomedir: "/var/spool/moab"
  validators:
    - class: "OodAppkit::Validators::Groups"
      opts:
        groups:
          - "sysp"
          - "hpcsoft"
        allow: false

one of the possibilities for the ClusterDecorator would then be:

class ClusterDecorator ...
  ...

  def reservations_valid?
    return false unless has_reservations?
    @config['reservations'].fetch(:validators, []).all? do |v|
      v.fetch('class', 'OpenStruct').constantize.new(v['opts'] || {}).success?
    end
  end
end

Could definitely clean up the code, but that is one possible idea.

ericfranz commented 7 years ago

That way OodAppkit doesn't need to know the implementation of the corresponding object.

OodAppkit doesn't need to know regardless of whether we do "torque" or "OodJob::Adapters::Torque". A factory method on ood_job could handle this ("torque" to "OodJob::Adapters::Torque"). Actually, it might be better to just have OodAppkit provide an object that gives you access to this config information. There is no ClusterDecorator#adapter. Rather, ood_job_rails instantiates the adapter, and uses the config provided by OodAppkit.

Its a better user experience to write a config like this:

metadata:
  title: Oakley
login:
  host: "oakley.osc.edu"
jobs:
  adapter: torque
  host: "oak-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.1"

as opposed to this:

metadata:
  title: Oakley
login:
  host: "oakley.osc.edu"
jobs:
  class: "OodJob::Adapters::Torque"
  opts:
    host: "oak-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
    version: "6.0.1"

And my guess is the majority case will be "torque" or "slurm" not a custom adapter.

Also, using a completely flat hash does eliminate the option of having the keyword "adapter" be an argument to the adapter class. Thats okay. There are no cases in our current code that require this and for the few possible cases in the future that would want to specify an "adapter" argument its easy to be more specific and prefix the argument keyword with what type of adapter.

basilgohar commented 7 years ago

@nickjer I'm going to chime in here and say that I feel the flatter format without the class names in it looks more like a configuration file should be. The class names smells too strongly of implementation details leaking into somewhere it doesn't belong, whereas the flat config format is more about specifying the information that is crucial.

If you think this merits further discussion, I'm happy if someone wants to schedule a meeting so all the merits of both sides can be discussed.

nickjer commented 7 years ago

There are a couple issues with the flat design. The first is gleaned from the following statement:

Rather, ood_job_rails instantiates the adapter, and uses the config provided by OodAppkit.

This means that either the underlying library knows about OodAppkit and how it organizes the information it needs to instantiate itself, or we create a second 'rails' specific library for every underlying library (e.g., ood_job_rails). The latter option means that ood_reservations would have an ood_reservations_rails gem as well that required ood_appkit.

So each resource would need a factory library that includes both the resource library and the configuration library used to generate the resource object from a configuration object.

resource_library (ood_job) <=> factory_library (ood_job_rails) <=> config_library (ood_appkit)

The other issue is that the flat design doesn't address user authorization for a given cluster resource. The examples being:

individual resource authorization: not allowing sys admins access to the reservations resource on all clusters
full cluster authorization: not allowing non-ruby users access to any resource (login/jobs/reservations) on the ruby cluster

And my guess is the majority case will be "torque" or "slurm" not a custom adapter.

I am not confident enough to make that statement, I err on the side of caution and feel it is better to be initially flexible and after we have more experience with the various centers' infrastructures to then introduce generic torque and slurm keywords.

I am not opposed to the keywords torque and slurm, but not confident enough that we disable support of class names and remove that flexibility. Also, I will have to look into how Rails handles the db keywords, as well as how it would handle a custom db adapter.

If you think this merits further discussion, I'm happy if someone wants to schedule a meeting so all the merits of both sides can be discussed.

I am fine with a deep-dive on this. Although I'll let @ericfranz schedule it if he feels it is also necessary.

ericfranz commented 7 years ago

@nickjer is this how validations work? Did I miss anything?

Two classes:

OodAppkit::Validator - https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/validator.rb
OodAppkit::Validators::Groups - https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/validators/groups.rb

Used this way by specifying in Cluster config:

validators:
  rsv_query:
    - type: "OodAppkit::Validators::Groups"
      data:
        groups:
          - "sysp"
          - "hpcsoft"
        allow: false

validators:
  cluster:
    - type: "OodAppkit::Validators::Groups"
      data:
        groups:
          - "ruby"
        allow: true
  rsv_query:
    - type: "OodAppkit::Validators::Groups"
      data:
        groups:
          - "sysp"
          - "hpcsoft"
        allow: false

In code:

https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/cluster_decorator.rb

Each cluster has one or more validators:

# @param validators [Hash{#to_sym=>Array<Validator>}] hash of validators
def initialize(cluster:, id:, title: "", url: "", validators: {}, **_)

Validations occur when calling ClusterDecorator#valid?:

# Whether the given method is valid (i.e., passes all supplied validators)
# @param method [#to_sym] method to check if valid
# @return [Boolean] whether this method is valid
def valid?(method = :cluster)
  @validators.fetch(method.to_sym, []).all? { |v| v.success? }
end

the argument method in valid?(method = :cluster) acts as a "scope" on the validators to choose which validator array to use

Usage:

Currently, Active Jobs and My Jobs, and Dashboard, in an initializer, we call valid? to filter out cluster instances users don't have access to
In vncsim, before doing a reservation query using the reservations gem, we call valid?(:rsv_query) to filter out cluster instances that should not be included in the reservation query. This way we can prevent certain users in groups from querying reservations, which causes problems.

To Add a New Filter, these are the steps.

Create a new OodAppkit::Validator subclass. Class must be added to RUBY_PATH of a Ruby Passenger app (whether its part of the app code, or added to the gem, or added some other way). Does not work with non Ruby Passenger apps.
Specify in the config to use this OodAppkit::Validator subclass.

nickjer commented 7 years ago

@nickjer is this how validations work?

Yes.

Did I miss anything?

Nothing that stands out right off the top of my head.

ericfranz commented 7 years ago

This is how the config currently works. Notice we violate "what changes together should go together".

Here is an example config for OSC Ruby cluster:

---
v1:
  title: "Ruby"
  url: "https://www.osc.edu/supercomputing/computing/ruby"
  validators:
    cluster:
      - type: "OodAppkit::Validators::Groups"
        data:
          groups:
            - "ruby"
          allow: true
    rsv_query:
      - type: "OodAppkit::Validators::Groups"
        data:
          groups:
            - "sysp"
            - "hpcsoft"
          allow: false
  cluster:
    type: "OodCluster::Cluster"
    data:
      hpc_cluster: true
      servers:
        login:
          type: "OodCluster::Servers::Ssh"
          data:
            host: "ruby.osc.edu"
        resource_mgr:
          type: "OodCluster::Servers::Torque"
          data:
            host: "ruby-batch.osc.edu"
            lib: "/opt/torque/lib64"
            bin: "/opt/torque/bin"
            version: "6.0.1"
        scheduler:
          type: "OodCluster::Servers::Moab"
          data:
            host: "ruby-batch.osc.edu"
            bin: "/opt/moab/bin"
            version: "9.0.1"
            moabhomedir: "/var/spool/moab"
        ganglia:
          type: "OodCluster::Servers::Ganglia"
          data:
            host: "cts05.osc.edu"
            scheme: "https://"
            segments:
              - "gweb"
              - "graph.php"
            req_query:
              c: "Ruby"
            opt_query:
              h: "%{h}.ten.osc.edu"
            version: "3"

v1: is to "version" the config, since right now the config is a moving target
title and url attributes are set on ClusterDecorator
validators are set on OodAppkit::ClusterDecorator and used for ClusterDecorator#valid?
cluster provides the config for OodCluster::Cluster instantiation; we instantiate OodCluster::Cluster instance as specified and pass data as arguments
hpc_cluster is the one attribute that is set on OodCluster::Cluster itself: https://github.com/OSC/ood_cluster/blob/0eb898639a0de94146402ec7b2979b2eec9dd949/lib/ood_cluster/cluster.rb#L27-L29
servers: is a "hash" of servers to create; OodCluster::Cluster has named servers: torque, moab, ssh, ganglia https://github.com/OSC/ood_cluster/tree/0eb898639a0de94146402ec7b2979b2eec9dd949/lib/ood_cluster/servers
each OodCluster::Servers::Server subclass defines attributes that can be set
ood_job, the dashboard, ood_reservations, individual apps use various server instances, accessing by keys
valid server keys are defined in the yaml: ganglia, scheduler, resource_mgr, login but not in the ood_cluster gem itself
corresponding query methods to see if the servers exist on the OodCluster::Cluster are auto-generated i.e. ganglia_server? and login_server?

Details:

OodAppkit::ConfigParser https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/config_parser.rb reads the yaml file and create a ClusterDecorator object for each cluster config file, placing them in an OodAppkit::Clusters list.
This is done during initialization and added to OodAppkit.clusters attribute: https://github.com/OSC/ood_appkit/blob/b91a54d59173f222b6fed1fc30be232c508d50b2/lib/ood_appkit/configuration.rb#L67-L76
Since each Rails app uses ood_appkit, it has access to OodAppkit.clusters on load
Apps use OodAppkit.clusters (which is our config object) to connect to and provide access to resources outside of the app

ericfranz commented 7 years ago

Examples:

Displaying shell access URLs in Dashboard

In ApplicationHelper:

def clusters
  OodAppkit::Clusters.new(OodAppkit.clusters.select(&:valid?).select(&:hpc_cluster?))
end

def login_clusters
  OodAppkit::Clusters.new(clusters.select(&:login_server?))
end

In ERB view:

<% elsif app.role == "shell" %>
  <%= nav_link("Shell Access", "terminal", OodAppkit.shell.url, target: "_blank") if login_clusters.count == 0 %>

  <% login_clusters.each do |c| %>
    <%= nav_link "#{c.title} Shell Access", "terminal", OodAppkit.shell.url(host: c.login_server.host), target: "_blank" %>
  <% end %>

Displaying submit hosts list in "My Jobs"

In initializer:

https://github.com/OSC/ood-myjobs/blob/e6fd87bde56168dca83403fbeeb85a19ac1b66c8/config/initializers/ood_appkit.rb

# config/initializers/ood_appkit.rb

OODClusters = OodAppkit.clusters.select do |c|
  c.valid? && c.hpc_cluster? && c.resource_mgr_server? && c.resource_mgr_server.is_a?(OodCluster::Servers::Torque)
end.each_with_object({}) { |c, h| h[c.id] = c }

# the controller will update status manually
OscMacheteRails.update_status_of_all_active_jobs_on_each_request = false

In the rest of the app, using OODClusters global: https://github.com/OSC/ood-myjobs/search?utf8=%E2%9C%93&q=OODClusters

# app/models/manifest.rb 

def default_host
  OODClusters.first ? OODClusters.first[0].to_s : ""
end

In app/views/workflows/_form.html.erb:

<%= f.select :batch_host, OODClusters.map { |key, val| [ "#{val.title} (#{val.resource_mgr_server.host})", key ] }, { label: "Batch Server:" }, { class: "selectpicker", id: "batch_host_select", required: true } %>

Instantiating `ood_job` instance in "My Jobs"

See ResourceMgrAdapter https://github.com/OSC/ood-myjobs/blob/04eca4613d7493333c7ec90ca2d5ecec1996fc7d/app/models/resource_mgr_adapter.rb#L54-L59:

Get cluster instance for cluster id:

def cluster_for_host_id(host)
  raise PBS::Error, "host nil" if host.nil?
  raise PBS::Error, "host is invalid value: #{host}" unless OODClusters.has_key?(host.to_sym)

  OODClusters[host.to_sym]
end

Using the cluster instance to instantiate an ood_job adapter instance:

def adapter
  OodJob::Adapters::Torque
end

def qsub(script_path, host: nil, depends_on: {}, account_string: nil)
    script_path = Pathname.new(script_path)
    raise OSC::Machete::Job::ScriptMissingError, "#{script_path} does not exist or cannot be read" unless script_path.file? && script_path.readable?

    cluster = cluster_for_host_id(host)
    script = OodJob::Script.new(content: script_path.read, accounting_id: account_string)
    adapter.new(cluster: cluster).submit(script: script, **depends_on)
rescue OodJob::Adapter::Error => e
  raise PBS::Error, e.message
end

Notice, we currently have the issue of hardcoding the adapter to use, OodJob::Adapters::Torque.

Instantiating `ood_job` instance in `ood_job_rails`

Note: the final version of ood_job_rails is not yet determined.

https://github.com/OSC/ood_job_rails/blob/a03167f40b2ef9ca6dc44db3022daa707324cab7/lib/ood_job_rails/adapter.rb#L16-L28

Clusters is set in the adapter initializer:

# OodJobRails::Adapter
def initialize(clusters: OodAppkit.clusters, default_script: OodJobRails.default_script)
  @clusters       = clusters
  @default_script = default_script.to_h
end

Default ood_job adapter that is used is specified during app initialization:

https://github.com/OSC/ood_job_rails/blob/a03167f40b2ef9ca6dc44db3022daa707324cab7/lib/ood_job_rails/configuration.rb

def set_default_configuration
  self.adapter  = OodJob::Adapters::Torque # sets OodJobRails.adapter
  self.default_script = {}
end

# code to submit in OodJobRails::Adapter

def submit(cluster_id:, script:, after: [], afterok: [], afternotok: [], afterany: [], **_)
  cluster = clusters[cluster_id]
  script  = OodJob::Script.new default_script.merge(script.to_h)
  OodJobRails.adapter.new(cluster: cluster).submit(
    script:     script,
    after:      after,
    afterok:    afterok,
    afternotok: afternotok,
    afterany:   afterany
  )
rescue OodJob::Adapter::Error => e
  raise Error, e.message
end

ericfranz commented 7 years ago

How reservations work:

In vncsim, we filter out the clusters who have a validation scoped to :rsv_query that return true for the user AND

https://github.com/AweSim-OSC/vncsim/blob/dda5260855402cc3bb23e49852c0eb0b0a241d21/app/controllers/sessions_controller.rb#L22-L26

OodReservations itself uses multiple Cluster servers (both Torque and Moab):

# instantiate a reservations query for a cluster
OodReservations::Query.build(cluster: c)

build actually returns a query instance only if the cluster has all the required resource managers as specified in this code:

https://github.com/OSC/ood_reservations/blob/master/lib/ood_reservations/queries/torque_moab.rb#L16-L21


# in OodReservations::Query

def self.build(**kwargs)
  if Queries::TorqueMoab.match(**kwargs)
    Queries::TorqueMoab.new(**kwargs)
  else
    nil
  end
end

# in OodReservations::Queries::TorqueMoab
def self.match(cluster:, **_)
  cluster.resource_mgr_server? &&
    cluster.scheduler_server? &&
    cluster.resource_mgr_server.is_a?(OodCluster::Servers::Torque) &&
    cluster.scheduler_server.is_a?(OodCluster::Servers::Moab)
end

ericfranz commented 7 years ago

We discussed offline. This is the suggested approach:

All relevant code that is spread across ood_cluster, ood_job, ood_reservations and ood_appkit into a single gem ood_core which is not a Rails Engine.
The reservations gem code will be moved to ood_core/reservations and specific adapters under ood_core/reservations/adapters
The jobs gem code will be moved to ood_core/jobs and the specific adapters to ood_core/jobs/adapters
OodCore::Cluster will not be accompanied by any other special classes like "OodCluster::Server" or subclasses, and instead will have these methods:
1. #jobs - a hash of connection information, including the adapter type to load
2. #jobs_adapter - instantiates an adapter (and requires the adapter code necessary) based on the connection information, passing in the connection information to the adapter it instantiates
3. #jobs? - whether, given conn info, an adapter is available; also applies validations with contexts :cluster and :jobs
4. #reservations - a hash of connection information, including the adapter type to load
5. #reservations_adapter - instantiates an adapter (and requires the adapter code necessary) based on the connection information, passing in the connection information to the adapter it instantiates
6. #reservations? - whether, given conn info, an adapter is available; also applies validations with contexts :cluster and :reservations
7. #login - hash of information required for connection i.e. { host: "oakley.osc.edu" }
8. #login? - whether the connection info required is available
9. #id - cluster id, like before i.e. "oakley, "ruby"
10. #metadata - ? or we just put this on the cluster itself, like #title etc.
11. We could just keep these as simple hashes of conn info, or consider something that defines required types for safety purposes. Needs exploration.
Adapters will be required at runtime when calling Cluster#jobs_adapter or Cluster#reservations_adapter. All jobs_adapters must be under OodJob::Adapters module.

Example:

def jobs_adapter
  # TODO: or return a NullAdapter object 
  return nil unless jobs?

  require "ood_job/adapters/#{jobs.type}" # slurm
  "OodJob::Adapters::#{jobs.type.classify}".constitize.build(jobs)
end

Need to decide whether Cluster#jobs, Cluster#login, Cluster#reservations return Struct, OpenStruct, Hash, or something else. Also, methods like Cluster#jobs? seem required to load the code in order to determine it exists. Not sure about that.

ericfranz commented 7 years ago

This new design does these things:

lets us use a simpler config that makes installation easier (config doesn't require class names in yaml)
establishes who is responsible for instantiating adapters for the connection info provided
SRP violation removed by putting configuration and parser and config objects and adapters that are configured by those objects in same gem together

ericfranz commented 7 years ago

Inspiration is from Rails itself:

https://github.com/rails/rails/blob/ecca24e2d76f647f342e6bdf8c68a693ff49ae9a/activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb#L14-L44 and https://github.com/rails/rails/blob/ecca24e2d76f647f342e6bdf8c68a693ff49ae9a/activerecord/lib/active_record/connection_adapters/connection_specification.rb#L170-L191

nickjer commented 7 years ago

So what does the final polished yaml file look like?

Is this it?

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
  rsvs:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false
login: "ruby.osc.edu"
jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
rsvs:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"

Note that Validators can probably be replaced with OodSupport::Acl as it pretty much does the same thing. But maybe for the future.

Also, what do we do with #hpc_cluster?. Currently the only apps that use this are the Dashboard, MyJobs, and SystemStatus apps as they want to loop through all the clusters. Although I am not entirely sure the Dashboard even needs this as it seems to only care for login servers. Also SystemStatus may not need it as well since it looks for clusters with Ganglia support. Maybe support for this should be removed, and the MyJobs app just have a .env.local file that a sysadmin can blacklist specific clusters.

Another question, I noticed the SystemStatus app looks for a cluster with a scheduler. How will this be supported in the new format?

nickjer commented 7 years ago

Another question, I noticed the SystemStatus app looks for a cluster with a scheduler. How will this be supported in the new format?

Maybe we handle this later the same way we handled ood_job with a universal interface called ood_scheduler. But further down the road.

Until then, the SystemStatus app may have to parse out the moab settings in OodCluster#rsvs.

ericfranz commented 7 years ago

Maybe support for this should be removed, and the MyJobs app just have a .env.local file that a sysadmin can blacklist specific clusters.

Lets use a separate section for vdi. If we think of configuring features. Then the quick cluster config may omit both login and jobs sections but have a vdi section. The config would look like this:

---
metadata:
  title: "Quick"
vdi:
  adapter: torque
  host: "quick-batch.ten.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
  version: "6.0.2"

ericfranz commented 7 years ago

As for SystemStatus app using scheduler... what "generic version" of the data that is being used for SystemStatus is not determined. It is a moab and ganglia specific app right now. The config above essentially is shifting from describing "servers available" and the config to connect to them to describing "features" and the config to use those features.

So just providing conn info for moab and ganglia might make the most sense.

I guess this is the challenge. If you are choosing a specific adapter to use, instead of letting the config indicate which adapter to use, how do you get connection information from the config. One way would be to optionally provide configs for specific servers like before.

Something like:

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
  rsvs:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false
login: "ruby.osc.edu"
jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
rsvs:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"

# conn info for specific servers that apps that are defined to use these servers can pull from
moab:
  host: "oak-batch.osc.edu"
  bin: "/opt/moab/bin"
  version: "9.0.1"
  moabhomedir: "/var/spool/moab"  
ganglia:
  host: "cts05.osc.edu"
  scheme: "https://"
  segments:
    - "gweb"
    - "graph.php"
  req_query:
    c: "Ruby"
  opt_query:
    h: "%{h}.ten.osc.edu"
  version: "3"

Note that of course we need to figure out what we are doing with Ganglia...

ericfranz commented 7 years ago

Also I'm not opposed to this:

rsvs:
  adapter: torque_moab
  torque: 
    host: "ruby-batch.osc.edu"
    lib: "/opt/torque/lib64"
    bin: "/opt/torque/bin"
  moab: 
    host: "ruby-batch.osc.edu"
    bin: "/opt/moab/bin"
    homedir: "/var/spool/moab"

in place of

rsvs:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"

I guess then we could define servers in the config below, and have them also be anchors, and then use the anchors in the feature sections above like rsvs and jobs.

ericfranz commented 7 years ago

But lets use the word reservations instead of the abbreviation rsvs:

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
-  rsvs:
+  reservations:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false
login: "ruby.osc.edu"
jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
-rsvs:
+reservations:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"

ericfranz commented 7 years ago

Note that this:

validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
  reservations:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false

can be done like this too:

validators:
  default:
    - adapter: groups
      groups: [ ruby ]
      allow: true
  reservations:
    - adapter: groups
      groups: [ sysp, hpcsoft ]
      allow: false

Would be more readable if instead of

validators:
  default:
    - adapter: groups
      groups: [ ruby ]
      allow: true
  reservations:
    - adapter: groups
      groups: [ sysp, hpcsoft ]
      allow: false

we did

validators:
  default:
    - adapter: groups_whitelist
      groups: [ ruby ]
  reservations:
    - adapter: groups_blacklist
      groups: [ sysp, hpcsoft ]

nickjer commented 7 years ago

The config above essentially is shifting from describing "servers available" and the config to connect to them to describing "features" and the config to use those features.

So just providing conn info for moab and ganglia might make the most sense.

I thought these "features" would be properly defined in the Cluster object. So there will be a #jobs and #metadata. Are we saying that all clusters will have a #moab method as well? This may lead to an explosion of methods with no proper documentation on all the features available. Maybe the Moab connection settings should be namespaced under a more generic convention like #native...

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
  rsvs:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false
login: "ruby.osc.edu"
jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
rsvs:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"
native:
  moab:
    host: "oak-batch.osc.edu"
    bin: "/opt/moab/bin"
    version: "9.0.1"
    moabhomedir: "/var/spool/moab"  
  ganglia:
    host: "cts05.osc.edu"
    scheme: "https://"
    segments:
      - "gweb"
      - "graph.php"
    req_query:
      c: "Ruby"
    opt_query:
      h: "%{h}.ten.osc.edu"
    version: "3"

And as we introduce more generic library interfaces like ood_scheduler then we pull it out of native and make it a "feature".

nickjer commented 7 years ago

Lets use a separate section for vdi

This can work but may be confusing to both sys admins and developers. They may both be expecting a #jobs as Quick is not so different from a regular batch server. The scaffolding that builds out the Job model would also have to know whether to use #jobs or #vdi depending on the cluster you intend to submit to. Also the job-status app would have to know about this as well since it will use #jobs to get the status of all the jobs on the clusters.

I think it introduces too many complications just to specify a cluster that is very OSC-specific.

ericfranz commented 7 years ago

I like the idea of native. But instead of calling it native lets call it custom. The other sections are predefined groups of connection information, and custom would contain custom groups of connection information.

ericfranz commented 7 years ago

Regarding VDI, so instead we do need a "cluster type".

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
  type: vdi

But we could have the default type be "cluster" or "hpc" and do @cluster.type == :vdi or @cluster.type_vdi?

ericfranz commented 7 years ago

Actually, the problem with that is in the future we will probably support running VDI GUI apps by running processes locally.

nickjer commented 7 years ago

I can make a custom section. Although we can talk further about the VDI issue in the group meeting today. As my concern revolves around the fact that only one app may need this information.

ericfranz commented 7 years ago

Or we omit custom and just put custom alongside other app configs.

For example:

metadata:
  title: Ruby
  url: "https://www.osc.edu/supercomputing/computing/ruby"
validators:
  default:
    - adapter: groups
      groups:
        - "ruby"
      allow: true
  rsvs:
    - adapter: groups
      groups:
        - "sysp"
        - "hpcsoft"
      allow: false
login: "ruby.osc.edu"
jobs:
  adapter: torque
  host: "ruby-batch.osc.edu"
  lib: "/opt/torque/lib64"
  bin: "/opt/torque/bin"
rsvs:
  adapter: torque_moab
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"
systemstatusapp:
  torque_host: "ruby-batch.osc.edu"
  torque_lib: "/opt/torque/lib64"
  torque_bin: "/opt/torque/bin"
  moab_host: "ruby-batch.osc.edu"
  moab_bin: "/opt/moab/bin"
  moab_homedir: "/var/spool/moab"

Thats just as easy.

nickjer commented 7 years ago

VNCSim

gets app from uri path /pun/sys/vncsim/:app_token (e.g., /pun/sys/vncsim/sys/bc_osc_desktop):

VNCApp::Application.routes.draw do
  scope ':app_token', constraints: { app_token: /((usr\/[^\/]+)|dev|sys)\/[^\/]+/ } do
    resources :sessions, only: [:index, :show, :create, :destroy]
  end
end

sets app in model:

class App
  def self.from_token(token)
    ary   = token.split('/')
    type  = ary.first
    owner = ary[1] if ary.size == 3
    name  = ary.last
    new(type: type, owner: owner, name: name)
  end

  def initialize(type:, owner: nil, name:)
    @type  = type
    @owner = owner
    @name  = name
  end

  ...
end

session models take this app object as an initialization parameter to define a session object

class Session
  # The app this session is modeled around
  attr_accessor :app

  # Find all submitted jobs by user that are not
  # completed and have specified job name
  def self.all(app)
    _search(app, nil) do |q|
      q.select do |k, v|
        /^#{ENV['USER']}@/ =~ v[:Job_Owner] && /^#{ENV['APP_TOKEN']}\/#{app.token}\/(?!.*!$)/ =~ v[:Job_Name]
      end
    end.sort
  end

  # Find submitted session based on job id
  def self.find(app, id)
    _search(app, id.gsub('_', '.')) {|q| q}.first
  end

  # Sub-app hash
  def sub_app
    app.config.fetch(batch_type, {}).fetch(app_idx, nil)
  end

  # Name of this sub-app
  def name
    sub_app['name']
  end

  ...
end

Notice that an app can have multiple sub_apps defined in its app.yml:

title: 'Abaqus/CAE'
compute:
  - &abaqus
    name: 'Abaqus/CAE 6.14 (Software Rendering)'
    batch: 'oakley'
    node: &software_nodes
      - type: 'any'
        ppn: 12
        desc: >
          Choose any type of compute node. This reduces the wait time
          as there are no node specific requirements.
      - type: 'bigmem'
        ppn: 12
        desc: >
          This node has 192GB of available RAM. There are only 8 of
          these nodes on Oakley.
      - type: 'hugemem'
        ppn: 32
        desc: >
          This node has 1TB of available RAM as well as 32 cores.
          There is only 1 of these nodes on Oakley. A reservation may
          be required to use this node.
    pbs:
      envvars:
        ABAQUS_MODULE: 'abaqus/6.14'
        GPU_OFF: '1'
      resources: &resources
        software: 'abaqus+#{(5 * (nodes.to_i * node_ppn.to_i) ** 0.422).floor}'
  - <<: *abaqus
    name: 'Abaqus/CAE 6.14 (Hardware Rendering)'
    node: &hardware_nodes
      - title: 'vis'
        type: 'vis:gpus=1'
        ppn: 12
        desc: >
          This node may come with 1 to 2 Nvidia GPUs. Allows for 3D
          visualization software to run as well as CUDA computations.
    pbs:
      envvars:
        ABAQUS_MODULE: 'abaqus/6.14'
      resources: *resources
  - <<: *abaqus
    name: 'Abaqus/CAE 2016 (Software Rendering)'
    node: *software_nodes
    pbs:
      envvars:
        ABAQUS_MODULE: 'abaqus/2016'
        GPU_OFF: '1'
      resources: *resources
  - <<: *abaqus
    name: 'Abaqus/CAE 2016 (Hardware Rendering)'
    node: *hardware_nodes
    pbs:
      envvars:
        ABAQUS_MODULE: 'abaqus/2016'
      resources: *resources

Each sub_app is namespaced by batch_type (compute or shared) and app_idx (the index of the sub app in the array).

Now for a given batch_type and app_idx you have methods Session#name, Session#path, Session#batch, ... that are read from the app's app.yml that you see in each app repo.

In the views:

batch_type defines the form to use as well as the display panel (e.g., do we request or show walltime)
app_idx just configures the form selection boxes with options (e.g., what node types can the user choose)

A sub_app session can be submitted with:

class Session
  # Submit this model to the PBS batch
  def submit
    h = {
      PBS::ATTR[:N] => "#{ENV['APP_TOKEN']}/#{app.token}/#{name.parameterize}",
      PBS::ATTR[:o] => output_dir.join("$PBS_JOBID.output").to_s,
      PBS::ATTR[:j] => "oe",
      PBS::ATTR[:S] => "/bin/bash",
      PBS::ATTR[:m] => mail.to_i.zero? ? "n" : "b",
      PBS::ATTR[:init_work_dir] => output_dir
    }.merge headers.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }
    r = {
      nodes: "#{nodes}:ppn=#{node_ppn}#{node_type_pbs}",
      walltime: "#{hours}:00:00",
    }.merge resources.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }
    e = {
      vnc_batch_type: batch_type,
      vnc_app_idx: app_idx,
      vnc_node_idx: node_idx,
      ROOT: staged_dir
    }.merge envvars.each_with_object({}) {|(k, v), h| h[k] = eval("%{#{v}}") }

    # Add project account if specified
    h = h.merge(PBS::ATTR[:A] => account) unless account.blank?

    osc_session.submit headers: h, resources: r, envvars: e
    true
  rescue PBS::Error => e
    msg = "<b>Failed to submit batch job:</b><pre>#{e.message}</pre>"
    errors.add(:batch, msg)
    Rails.logger.error(msg)
    false
  end
end

and a sub_app session can be read from qstat by:

class Session
  # Set a session implicitly through a PBS batch status hash
  def _set_from_query(id, attribs)
    self.pbsid = id.to_s                                    # 329083.oak-batch.osc.edu

    # Get attributes for this job
    self.status = attribs.fetch(PBS::ATTR[:state])                                # R
    self.nodes = attribs.fetch(PBS::ATTR[:l]).fetch(:nodes).split(":")[0]         # 1:ppn=12
    self.hours = attribs.fetch(PBS::ATTR[:l]).fetch(:walltime).split(":")[0].to_i # '01:00:00'
    self.created_time = attribs.fetch(PBS::ATTR[:ctime]).to_i                     # '103908239'
    self.time_left = attribs.fetch(:Walltime, {}).fetch(:Remaining, '0').to_i     # '3909'

    # Parse env vars for app info
    envvars = attribs.fetch(PBS::ATTR[:v]).split(",").inject({}) do |h,s|
      k,v = s.split("=")
      h[k] = v
      h
    end
    self.batch_type = envvars['vnc_batch_type']
    self.app_idx = envvars['vnc_app_idx']
    self.node_idx = envvars['vnc_node_idx']

    # Get number of cores after finding app
    self.cores = nodes.to_i * node_ppn

    # Get osc-vnc session for this job
    _get_osc_vnc_session

    self
  end
end

nickjer commented 7 years ago

This is implemented in v2 of the cluster config.

OSC / ood_appkit

Simpler cluster config #36

Displaying shell access URLs in Dashboard

Displaying submit hosts list in "My Jobs"

Instantiating ood_job instance in "My Jobs"

Instantiating ood_job instance in ood_job_rails

Instantiating `ood_job` instance in "My Jobs"

Instantiating `ood_job` instance in `ood_job_rails`