SGCI / sgci-resource-inventory

This contains all the computational resource entities
https://sgci-resource-inventory.readthedocs.io/en/latest/introduction.html
Apache License 2.0
6 stars 2 forks source link

Replace partitions with hardwareProfiles #13

Closed ericfranz closed 3 years ago

ericfranz commented 3 years ago

Fixes #14 #15 #16

Rename partitionDefintion to hardwareProfile and add submitArgs as a new property.

The partitionDefinition is a special case of a hardwareProfile where the submission argument for that profile using Slurm's sbatch would be "--partition=name_of_partition". Making a more general hardwareProfile accomodates several use cases.

At OSC our Pitzer cluster has 7 node types and instead of specifying the partition users must use constraints to specify whether they want a 40 core node (installed 2018) or 48 core Pitzer (installed 2020 as part of the Pitzer expansion). We can create a hardware profile like:

  "batchSystem": {
    "jobManager": "SLURM",
    "hardwareProfiles": [
      "name": "48 core",
      "description": "Pitzer expansion standard compute node",
      "submitArgs": [
        "--constraint=48core"
      ]
    ]

When requesting a GPU on Pitzer OSC users must use the --gpus-per-node=1. We give users the option to use --gres=vis which will start up an X server in the background. The hardwareProfile definition can accomodate these cases as well.

  "batchSystem": {
    "jobManager": "SLURM",
    "hardwareProfiles": [
      "name": "48 core vis",
      "description": "Pitzer expansion standard compute node",
      "submitArgs": [
        "--constraint=48core",
        "--gres=vis"
      ],
      "computeQuotas": {
        "maxGPUsPerJob": 2
      }
    ]

If submitArgs string has a space in it, this string may need to be split by the Gateway using a shell split strategy like shlex.split to ensure splitting a string surrounded by quotes does not occur, so that these arguments could be safely used with libaries like popen which may expect a sequence of arguments and may handle proper shell escaping of the arguments if passed as a sequence/array. Or each array item could be prefixed with the batch scheduler appropriate script directive (i.e. #SBATCH or #PBS or #$).

The submitArgs would differ based on the scheduler used (since arguments for qsub look different than arguments for sbatch). This may be a benefit or a drawback depending on your perspective.

One drawback is the added complexity to the original use case where the hardwareProfile is just a partition. In this case, you now have to add submitArgs with "--partition=name_of_partition".

A Gateway could use this information several ways when building a web form for job submission:

  1. present a dropdown of hardware profiles for users to choose from
  2. present fields such as nodes, gpus, memory and based on user settings, choose the appropriate profile

In both cases, the Gatway would have less responsibility for determining the variety of submission arguments required in addition to basic arguments like --nodes= and --ntasks-per-node=.

ericfranz commented 3 years ago

Another drawback is that a hardware profile on Pitzer that referred to a standard compute node, omitting the constraint 40core or 48core, would be a profile that would actually encompass multiple node types with multiple CPU counts (the 2018 Pitzer nodes have 40 cores and the 2020 Pitzer nodes have 48 cores). In this case, if a user specifies 48 cores they will definitely get the 2020 Pitzer node, but if the user specifies 40 cores they may get 2018 or 2020 Pitzer node, whichever is available sooner.

So to get the end result for the gateway, I might define cpuCount in nodeHardware (and even memory in node hardware) to be that of the 2020 Pitzer node, but it would technically be inaccurate.

Does that mean that nodeHardware should actually be an array instead of an object so it could accommodate multiple node types?

ericfranz commented 3 years ago

The last two commits (8590f6520c9f06cbdad0295e1bffc603b9d51774 213631c84f6f573041651646365ac6a558a1395c) may be slightly controversial.

Summary:

  1. port was already present in connectionDefinition so just added host - this seems to serve the purpose of the discussed serviceHost servicePort and its shorter (and we use just "host" in other places); if you feel that both should be renamed to serviceHost and servicePort that's fine with me
  2. proxyHost and proxyPort could still serve a purpose in case of accessing via bastion host
  3. useProxy seemed redundant - if proxyHost and proxyPort are not present, don't use it, if present, use it?