configure supremm for xdmod

rob-baron commented 1 year ago

From reading the documentation, I suspect that we need to configure supremm to get the memory figures for hpc (OpenShift).

joachimweyl commented 1 year ago

@rob-baron what are the next steps for this issue? I belive there is communication with XDMoD on this issue can we include some of that here?

rob-baron commented 1 year ago

@joachimweyl

So far the dependencies for Supremm have been configured, specifically:

Mongo
Adding the xdmod usernames and passwords to mongo
Installing Supremm via an RPM and building the image on the system

According to the xdmod documentation (https://supremm.xdmod.org/10.0/supremm-configuration.html) there should be an Supremm option in the main menu of xdmod-setup. Unfortunately, we don't have that on ocp-staging, so either there is a requirement that was not installed or I've missed a step or something didn't install correctly.

I've contacted the xdmid-team with all of the necessary information. I've checked in phase 1 as it gets us closer to being able to run Supremm.

So the next step is to do what the xdmcd-team tells us to do.

rob-baron commented 1 year ago

So, what has happened is supremm was installed in building the docker image, however it was not copied over to the PV that stores the modifications which were made to circumvent the secure connection check as the route handles the secure connection.

As this only affects some of the files that were installed, we just need to copy the files that are missing over to the PVs.

joachimweyl commented 1 year ago

@rob-baron what PR is this work being done in?

rob-baron commented 1 year ago

What do you think the PR is going to do? What I'm trying to do is to install supremm without disturbing the data or the parts of the configuration directory (or the parts of the source directory) that we have worked on getting to work.

joachimweyl commented 1 year ago

I assumed to ensure this will all eventually be part of the XDMoD install that it would be part of a PR so that it can be tracked and reviewed. If that is incorrect, please let me know.

rob-baron commented 1 year ago

@joachimweyl At this point it is, eventually, I remove the PV that overlays the source code directory in the production container, while retaining the possibility of using the PV for development purposes.

rob-baron commented 1 year ago

So far: 1) all components from the RPM were installed 2) xdmod-setup has supremm in it's main menu 3) the databases were setup (each one required additional grants for the xdmod user - similar to all of the other databases used by xdmod. The other database grants are handled by xdmod-init.

What is left to do: 4) finish the configuration 5) figure out the next steps to get this part working

finally, deploy to the infra cluster.

rob-baron commented 1 year ago

Ran xdmod-setup to create the databases

ensured that mongo db was running

Added accounts to mongo db: (is how the hpc-tools tutorial https://github.com/ubccr/hpc-toolset-tutorial sets up xdmod)

mongosh mongo:27017 -u root -p
  use supremm
  db.createUser({user: 'xdmod', pwd: 'password', roles: [{role: 'readWrite', db: 'supremm'}]});
  db.createUser({user: 'xdmod-ro', pwd: 'password', roles: [{role: 'read', db: 'supremm'}]});

ran:

  xdmod-ingestor --last-modified-start-date "2018-01-01 12:30:00"
  aggregate-supreme
  acl-config

The command aggregate-supreme has the following error:

"Create index undefined returned MongoError: command createIndexes requires authentication"

rob-baron commented 1 year ago

Have fixed the aggregate-supremm issue, so aggregate-supremm runs without generating any warning or errors.

The question has become why is no data flowing from the jobs tables to the supremm tables. I suspect it could be a combination of:

1) Incorrect data 2) supremm configuration issue

rob-baron commented 1 year ago

In the process of checking the data.

first of all, the table that stores the job information (hpcdb_job) has an integer field that stores the number of CPUs, then the supremm table has a floating point that stores the number of cpus.

Not sure of the math that supremm uses, but it could convert the integer from the hpcdb_job to a floating point by using a known size of a compuational node - or by the know size of the hoc cluster. In order to test this out in staging before production, we should

shred and ingest data that has milli cpu as its cpu unit. This way 0.128 becomes an integer and can be set in the field in the jobs table.
Configure supremm on staging
aggregate supremm (and see if data is populated).

I currently have a query with the xdmod team on this, but this might answer the question before they get around to answering it.

rob-baron commented 1 year ago

27-JUL-2023 So far: 1) made a backup of the data directories from production 2) converted the hoc log files to milli cpu 3) shredded and ingested milli-cpu data and confirmed that we have more data getting into the jobs table 4) started setting up supremm.

rob-baron commented 1 year ago

Something that is worth trying, first in a staging environment is to convert both the nCPU and the req_cpu in the log files to milli cores (multiply the fractional amount by 1000, then take the floor) and then see if aggregate_supremm.sh is able to move the data from the job realm to the supremm realm.

CCI-MOC / xdmod-cntr

configure supremm for xdmod #170