CCI-MOC / ops-issues

2 stars 0 forks source link

Manually shred/ingest missing OpenShift log files into XDMoD? #903

Closed tzumainn closed 1 year ago

tzumainn commented 1 year ago

@rob-baron it looks like the XDMoD OpenShift job hasn't been running properly for a while due to an update in URLs causing a certificate mismatch issue. I've created log files as far back as I can (April 18); what would be the best way for me to have them shredded into the XDMoD instance?

rob-baron commented 1 year ago

I think the shredder will shred everything the next time it runs. The ingestor needs to be run with an additional parameter, so I can try this manually after the cron job has run once (at least that is what I have done in the past).

tzumainn commented 1 year ago

Ah, I generated the log files locally; is there an ideal way to get them to where they need to be?

rob-baron commented 1 year ago

Now I understand,

you should copy your files to "xdmod pod name":/root/xdmod_data/nerc_openshift_prod directory.

log into the pod as system (ie oc --as ... rsh "xdmod pod name")

the run the following changing the parameters of the shredder to those of your cron job:

cd /root/xdmod_data/nerc_openshift_prod xdmod-shredder -f openstack -d /data -r kaizen <<<=== with the parameters from your cron job xdmod-ingestor --last-modified-start-date "2018-01-01 12:30:00"

It shouldn't change anything to reingest what has already been ingested.

tzumainn commented 1 year ago

Thanks! I'm running into errors when I run ingestion:

2023-05-05 16:21:42 [notice] Finished Processing 1 SQL statements
2023-05-05 16:21:43 [notice] (action: xdmod.cloud-state-pipeline.delete-session-records (ETL\Maintenance\ExecuteSql), start_time: 1683303700.7898, end_time: 1683303703.6955, elapsed_time: 2.90568)
2023-05-05 16:21:47 [notice] Altering table `modw_cloud`.`session_records`
2023-05-05 16:21:51 [error] {"message":"xdmod.cloud-state-pipeline.cloud-session-records (ETL\\Ingestor\\DatabaseIngestor): SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'processorbucket_id' cannot be null Exception: 'SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'processorbucket_id' cannot be null'"}
2023-05-05 16:21:52 [warning] Stopping ETL due to exception in xdmod.cloud-state-pipeline.cloud-session-records (ETL\Ingestor\DatabaseIngestor)
2023-05-05 16:21:53 [critical] Aggregation failed: xdmod.cloud-state-pipeline.cloud-session-records (ETL\Ingestor\DatabaseIngestor): SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'processorbucket_id' cannot be null Exception: 'SQLSTATE[23000]: Integrity constraint violation: 1048 Column 'processorbucket_id' cannot be null' (stacktrace: #0 /usr/share/xdmod/classes/ETL/Ingestor/pdoIngestor.php(544): CCR\Loggable->logAndThrowException('SQLSTATE[23000]...', Array)
#1 /usr/share/xdmod/classes/ETL/Ingestor/pdoIngestor.php(459): ETL\Ingestor\pdoIngestor->singleDatabaseIngest()
#2 /usr/share/xdmod/classes/ETL/Ingestor/aIngestor.php(126): ETL\Ingestor\pdoIngestor->_execute()
#3 /usr/share/xdmod/classes/ETL/EtlOverseer.php(473): ETL\Ingestor\aIngestor->execute(Object(ETL\EtlOverseerOptions))
#4 /usr/share/xdmod/classes/ETL/EtlOverseer.php(435): ETL\EtlOverseer->_execute('xdmod.cloud-sta...', Object(ETL\Ingestor\DatabaseIngestor))
#5 /usr/share/xdmod/classes/ETL/Utilities.php(281): ETL\EtlOverseer->execute(Object(ETL\Configuration\EtlConfiguration))
#6 /usr/share/xdmod/classes/OpenXdmod/DataWarehouseInitializer.php(362): ETL\Utilities::runEtlPipeline(Array, Object(CCR\Logger), Array)
#7 /usr/bin/xdmod-ingestor(310): OpenXdmod\DataWarehouseInitializer->aggregateCloudData('2023-05-01 12:3...')
#8 /usr/bin/xdmod-ingestor(21): main()
#9 {main})

Any idea what might be causing that?

tzumainn commented 1 year ago

One other question - it looks like shredding the openshift log files is failing as well with a date format issue:

2023-05-05 17:24:43 [critical] Failed to shred files: Failed to shred line 1 of file 2023-04-18.log "0|0|nerc-ocp-prod|||external-secrets-operator|external-secrets-operator||||2023-04-18T08:01:00|2023-04-18T08:01:00|2023-04-18T08:01:00|2023-04-18T19:59:59|0-11:58:59||COMPLETED|1|0.01|0.1|256.0|cpu=0.1,mem=256.0|cpu=0.1,mem=256.0|0-11:58:59||cluster-external-secrets-6c7f4767bf-cg86w": SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect integer value: '' for column `mod_shredder`.`shredded_job_slurm`.`gid_number` at row 1 (stacktrace: #0 /usr/bin/xdmod-shredder(207): OpenXdmod\Shredder->shredFile('2023-04-18.log')

Do you know if the accepted date format has changed?

rob-baron commented 1 year ago

The ingestor is not aggregating the cloud side as we have some old flavors that are undefined in openstack, which I created flavors that have 0 cpu and 0 men. I should have made them 1 cpu, and 1 men. When I get done with the hierarchy, or at least closer, I will work harder at fixing that.

We haven't updated anything involving xdmod. Though they have just released 10.2, and I have it on my radar to do the upgrade, we will continue to be on 10.0 for a while. I have been less than thrilled with their error messages as sometimes they are correct and sometimes they are not.

I am using the same date format that you are, so I suspect the format is not the issue. I suspect you are missing a column as the next error message is 1366 Incorrect integer value: '' for column `mod_shredder`.`shredded_job_slurm`.`gid_number` at row 1.

You might throw on the --debug flag in shredding.

as for ingesting, try setting --aggregate=job

xdmod-ingestor --aggregate=job --last-modified-start-date 2018-01-01
tzumainn commented 1 year ago

Ah, yes - if I add in a dummy gid_number and uid_number, then it stops complaining. These used to be optional, as I've been leaving them blank; are these values you set on the OpenStack side? If not, then I'll just put in dummy integers for now.

tzumainn commented 1 year ago

Okay, this should be resolved by https://github.com/OCP-on-NERC/xdmod-openshift-scripts/pull/8

tzumainn commented 1 year ago

I've manually added logs from the start of May up to now. I guess I'll have to do this periodically until the PR I linked above is reviewed/merged/pushed into production!

tzumainn commented 1 year ago

@rob-baron thanks for the PR approval! With that merged, what needs to happen for it to make it into the infra cluster? Do the xdmod images need to be rebuilt through ArgoCD? I don't think I have the rights to do that.

rob-baron commented 1 year ago

I'll kick off the build.

tzumainn commented 1 year ago

Thanks! Do you know if the build went through? The openshift job failed with the same error, and ArgoCD says a bunch of things are out of sync.

rob-baron commented 1 year ago

that build failed. will kick it off again.

nerc-ocp-infra was very slow yesterday and is slower today. will let you know when it is finished.

tzumainn commented 1 year ago

The openshift cron job is failing with:

Failed to pull image "xdmod-openstack": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/xdmod-openstack: errors: denied: requested access to the resource is denied unauthorized: authentication required

However, I don't know if this might be because of the current infra cluster issues... ?

rob-baron commented 1 year ago

Nothing has changed, as we start the build process on that project.

Not sure what the format should be or why it is no longer shredding.

rob-baron commented 1 year ago

There is a latest version of xdmod-openstack, so we shouldn't see failed to pull image error if the cluster were working.

What that error means, most likely, is that kubernetes was unable to connect to the cluster's internal registry which is where the image is stored. When I checked on it, as the build process was failing to push to that registry, there were 3 images (previous builds) for each dockerfile, so I know the image is there.

I've started the build process multiple times on nerc-ocp-infra and it doesn't run to completion, though it does build on GitHub and on oct-staging (the mod cluster). I think we will just have to wait until the cluster is back to being healthy before we update the xdmod project.

In the meantime, feel free to modify your scripts so that we can test on oct-staging (https://console-openshift-console.apps.ocp-staging.massopen.cloud/k8s/ns/xdmod/build.openshift.io~v1~BuildConfig). It might be a good time to utilize the overlays so that we can use a staging overlay to handle the differences between ocp-staging:xdmod, nerc-ocp-infra:xdmod and nerc-ocp-infra:xdmod-staging.

joachimweyl commented 1 year ago

Infra issues are blocking this. build process stops at authorization to push to repo.

joachimweyl commented 1 year ago

@tzumainn now that the infra issues are resolved can we move forward with this?

tzumainn commented 1 year ago

Ah, I didn't realize they were resolved! I'll take a look.

tzumainn commented 1 year ago

I'm still getting the "ImagePullBackOff" error with new xdmod-openshift jobs - @rob-baron do you know if the xdmod images have been updated yet?

rob-baron commented 1 year ago

I kicked off the builds when I left yesterday. All three builds built and pushed to the internal registry.

the xdmod-openstack and injector cron jobs are completing, and seem to be working.

Might I suggest that you include the internal registry in with you image as in:

image: image-registry.openshift-image-registry.svc:5000/xdmod/moc-xdmod

as opposed to:

image: moc-xdmod

I do this in the overlays for the nerc infra cluster here:

https://github.com/CCI-MOC/xdmod-cntr/blob/main/k8s/overlays/nerc-ocp-infra/patches/cj-xdmod-openstack-shred.yaml

patching it via the customization file here:

https://github.com/CCI-MOC/xdmod-cntr/blob/main/k8s/overlays/nerc-ocp-infra/kustomization.yaml

I don't know why the nerc infra cluster goes from resolving to not resolving the internal registry. I don't have to include the internal registry in the image name on ocp-staging, though I sometimes do to be consistent with what I am deploying to the infra cluster.

tzumainn commented 1 year ago

Ah, thanks! The xdmod-openshift job already does specify the image as you suggest. However it's still running into the Init:ImagePullBackOff error - any thoughts as to why?

rob-baron commented 1 year ago

however, the init container does not:

          initContainers:
          - command:
            - /app/xdmod-get-config.sh
            image: xdmod-openstack
            imagePullPolicy: IfNotPresent
            name: download-config
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/xdmod
              name: vol-xdmod-conf
            - mountPath: /etc/openstack
              name: vol-clouds-yaml

Which appears to be the container that is failing to pull the image:

Rob@MacBook-Pro xdmod-pr % oc logs xdmod-openshift-prod-job-28064220-n5kf4 -c download-config
Error from server (BadRequest): container "download-config" in pod "xdmod-openshift-prod-job-28064220-n5kf4" is waiting to start: trying and failing to pull image
tzumainn commented 1 year ago

Ah, gotcha! I've submitted a PR - https://github.com/CCI-MOC/xdmod-cntr/pull/176. If it works for you, could you merge, and I guess update the openshift job?

tzumainn commented 1 year ago

Thanks for merging the PR! I noticed that the openshift pod has the same error; does it need to be synced through argocd? I'd do it myself, but I don't have permissions.

rob-baron commented 1 year ago

Argo should deploy everything without needing any manual intervention as these changes are just to the yaml.

I'll check on Argo to ensure that they go though.

I'm working though each of the pods that have similar errors.

tzumainn commented 1 year ago

Okay! The openshift job is still getting errors, unfortunately:

Failed to pull image "xdmod-openstack": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/xdmod-openstack: errors: denied: requested access to the resource is denied unauthorized: authentication required
rob-baron commented 1 year ago

And here is why:

spec:  
  restartPolicy: Never  
  initContainers:    
    - resources: {}
...      
      image: xdmod-openstack

Apparently ArgoCD is having difficulty updating the cron job.

I deleted the cron job so that ArgoCD can replace it.

tzumainn commented 1 year ago

Thanks! The new job seems to be running with no problems, and I've manually shredded logs to cover the month of May. All of this data can be viewed in the XDMoD UI.

joachimweyl commented 1 year ago

@tzumainn when shall we expect to see the data show up in the XDMoD dashboard?

tzumainn commented 1 year ago

I just checked - it's there right now!

joachimweyl commented 1 year ago

I did not realize we could not backfill March 13th through May 1st. Thank you for clarifying.