PermafrostDiscoveryGateway / viz-workflow

Processing workflow for visualization
Apache License 2.0
2 stars 1 forks source link

Troubleshoot connection issue between parsl and kubernetes cluster #36

Open julietcohen opened 3 months ago

julietcohen commented 3 months ago

Progress

The parsl and kubernetes viz workflow has been progressing nicely in the following ways:

Problems

A good sign

Print statements inserted into the script at all stages are being printed to the log output we specify when we run the python script (in the example command given above, that is k8s_parsl.log), including the final statement "script complete".

When running the parsl and kubernetes workflow with a parsl app that does not ingest data files nor output files, and instead executes a simple iterative mathematical operation with print statements and sleep periods inserted, the output seems to imply the script worked as expected. However, the pods are still not shutting down after, and the CrashLoopBackOff status is still the case.

Useful Commands

kubectl run -n pdgrun -i --tty --rm busybox --image=busybox -- sh initates a pod and inserts you into the pod so you can poke around. This pod is named busybox by default. An example of one way to "poke around" is to ping the IP address you specify in the parsl config to see if it is open. Example: telnet 128.111.85.174 54001

Suggested next steps

Thank you to Matthew Brook and Matt Jones for all your help troubleshooting thus far!

julietcohen commented 3 months ago

This issue is a sub-task towards the ultimate goal of https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/1

julietcohen commented 3 months ago

I was able to resolve the CrashLoopBackOff issue with the published image version 0.1.4. The script parsl_workflow.py in this version was running a simple mathematical function in parallel instead of the viz workflow. Running kubectl get pods during the process showed that they were running.

k8s

The job lasted a while, as it should because there are sleep periods inserted into the function, and it was running in a tmux session. Throughout the process, the script successfully output the print messages we expect from the function. The print statements in the script itself, outside of the function, also were present, such as Script complete. An example snippet is the following saved to the log:

image

When I checked back on the process this morning, at the end of the log, I see something new:

image

I wonder why there is a KeyboardInterrupt message at the end. I also do not know how to interpret the error. I also checked the pods status again, and they are still Running.

Overall, the workflow has certainly made a big step forward. I think it is safe to say that the job is running successfully now, and there seems to be something funky going on with how the pods are shutting down after the script completes.

julietcohen commented 3 weeks ago

A note for the last comment: The KeyboardInterrupt message at the end of the log was leftover from a previous run, when I cancelled the command to run the script due to issues. Should have realized the log is not replaced by the new log, but rather is appended to with each run, as is the case with the normal viz run as well.

Since the other tickets regarding setting up a new user and an env in the container have been resolved, there may not be a "connection issue" between parsl and k8s anymore, but rather just some adjustments to be made to write the log and the output viz tilesets. The workflow is not running smoothly yet, but I will be able to pinpoint the smaller issue better now that we have the new user and env set up in the container.