StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
67 stars 12 forks source link

Cannot schedule SAS notebook in ohsp namespace #1935

Closed Jose-Matsuda closed 5 months ago

Jose-Matsuda commented 5 months ago

Describe the bug

Users in the ohsp-pssb namespace are unable to schedule prob SAS notebooks. We get in the events

 kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[fdi-oha-inbox-
eprotb-protected-b fdi-oha-outbox-eprotb-protected-b], unattached volumes=[test-bemrose-volume istio-podinfo fdi-
oha-inbox-eprotb-protected-b aaw-unclassified-ro istiod-ca-cert workload-socket fdi-oha-eprotb-protected-b kube-api-
access-jwjcn workload-certs istio-envoy istio-token fdi-oha-outbox-eprotb-protected-b credential-socket protb-nb istio-
data aaw-protected-b]: timed out waiting for the condition

Environment info

Namespace: ohsp-pssb

Notebook/server: various, for each notebook that tries the fdi one it looks like though events taken from test-bemrose-0

Steps to reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '...'
  3. Scroll down to '...'
  4. See error

Expected behaviour

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Jose-Matsuda commented 5 months ago

Another error that I'm just now seeing is

 Warning  FailedMount  79s (x4 over 27m)  kubelet  MountVolume.MountDevice failed for volume "ohsp-pssb-fdi-protected-b-
oha-outbox-eprotb" : rpc error: code = Internal desc = Mount failed with error: rpc error: code = Unknown desc = exit status 1 
Error: failed to initialize new pipeline [failed to authenticate credentials for azstorage]
, output:
Please refer to http://aka.ms/blobmounterror for possible causes and solutions for mount errors.

Now investigating azure-blob-csi-system/csi-blob-node-zr8zn on the prob node logs

Jose-Matsuda commented 5 months ago

more information from FDI in response to raising the storageSPNClientSecret is not empty, use it to access storage account(...), container(...transit) the response was -- This transit container was not required by the client, therefore was never created. like 1

with more information being Just to maybe state the obvious, not all use cases Storage have a "transit" container, with its "Inbox" and "Outbox"; only the ones for which we've been asked to implement an automated ingestion and/or extraction pipeline. When no automation is needed, the "transit" container/folders shouldn't be mounted on your side. We document it in the Jira, as shown here: [CODAS-2298] FDI - Oral Health Analytics (OHA) - Common Storage in DAS - Statistics Canada Jira B (statcan.ca)

Jose-Matsuda commented 5 months ago

Gitlab pr (with details) to hopefully resolve? https://gitlab.k8s.cloud.statcan.ca/cloudnative/aaw/daaas-infrastructure/aaw-prod-cc-00/-/merge_requests/126

with pipeline here

Jose-Matsuda commented 5 months ago

That seems to have fixed it. A weird scenario thats explained in the gitlab pr where there was a moment where the changes werent applied and the client could work fine, but then a week ago we applied all the changes and that took in the original changes.

Just removing the transit fixed it.