Open rob-baron opened 1 year ago
@rob-baron what are the next steps for this issue? I belive there is communication with XDMoD on this issue can we include some of that here?
@joachimweyl
So far the dependencies for Supremm have been configured, specifically:
According to the xdmod documentation (https://supremm.xdmod.org/10.0/supremm-configuration.html) there should be an Supremm option in the main menu of xdmod-setup. Unfortunately, we don't have that on ocp-staging, so either there is a requirement that was not installed or I've missed a step or something didn't install correctly.
I've contacted the xdmid-team with all of the necessary information. I've checked in phase 1 as it gets us closer to being able to run Supremm.
So the next step is to do what the xdmcd-team tells us to do.
So, what has happened is supremm was installed in building the docker image, however it was not copied over to the PV that stores the modifications which were made to circumvent the secure connection check as the route handles the secure connection.
As this only affects some of the files that were installed, we just need to copy the files that are missing over to the PVs.
@rob-baron what PR is this work being done in?
What do you think the PR is going to do? What I'm trying to do is to install supremm without disturbing the data or the parts of the configuration directory (or the parts of the source directory) that we have worked on getting to work.
I assumed to ensure this will all eventually be part of the XDMoD install that it would be part of a PR so that it can be tracked and reviewed. If that is incorrect, please let me know.
@joachimweyl At this point it is, eventually, I remove the PV that overlays the source code directory in the production container, while retaining the possibility of using the PV for development purposes.
So far: 1) all components from the RPM were installed 2) xdmod-setup has supremm in it's main menu 3) the databases were setup (each one required additional grants for the xdmod user - similar to all of the other databases used by xdmod. The other database grants are handled by xdmod-init.
What is left to do: 4) finish the configuration 5) figure out the next steps to get this part working
finally, deploy to the infra cluster.
Ran xdmod-setup to create the databases
ensured that mongo db was running
Added accounts to mongo db: (is how the hpc-tools tutorial https://github.com/ubccr/hpc-toolset-tutorial sets up xdmod)
mongosh mongo:27017 -u root -p
use supremm
db.createUser({user: 'xdmod', pwd: 'password', roles: [{role: 'readWrite', db: 'supremm'}]});
db.createUser({user: 'xdmod-ro', pwd: 'password', roles: [{role: 'read', db: 'supremm'}]});
ran:
xdmod-ingestor --last-modified-start-date "2018-01-01 12:30:00"
aggregate-supreme
acl-config
The command aggregate-supreme has the following error:
"Create index undefined returned MongoError: command createIndexes requires authentication"
Have fixed the aggregate-supremm issue, so aggregate-supremm runs without generating any warning or errors.
The question has become why is no data flowing from the jobs tables to the supremm tables. I suspect it could be a combination of:
1) Incorrect data 2) supremm configuration issue
In the process of checking the data.
first of all, the table that stores the job information (hpcdb_job) has an integer field that stores the number of CPUs, then the supremm table has a floating point that stores the number of cpus.
Not sure of the math that supremm uses, but it could convert the integer from the hpcdb_job to a floating point by using a known size of a compuational node - or by the know size of the hoc cluster. In order to test this out in staging before production, we should
shred and ingest data that has milli cpu as its cpu unit. This way 0.128 becomes an integer and can be set in the field in the jobs table.
Configure supremm on staging
aggregate supremm (and see if data is populated).
I currently have a query with the xdmod team on this, but this might answer the question before they get around to answering it.
27-JUL-2023 So far: 1) made a backup of the data directories from production 2) converted the hoc log files to milli cpu 3) shredded and ingested milli-cpu data and confirmed that we have more data getting into the jobs table 4) started setting up supremm.
Something that is worth trying, first in a staging environment is to convert both the nCPU and the req_cpu in the log files to milli cores (multiply the fractional amount by 1000, then take the floor) and then see if aggregate_supremm.sh is able to move the data from the job realm to the supremm realm.
From reading the documentation, I suspect that we need to configure supremm to get the memory figures for hpc (OpenShift).