Open julietcohen opened 3 months ago
This issue is a sub-task towards the ultimate goal of https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/1
I was able to resolve the CrashLoopBackOff
issue with the published image version 0.1.4. The script parsl_workflow.py
in this version was running a simple mathematical function in parallel instead of the viz workflow. Running kubectl get pods
during the process showed that they were running.
The job lasted a while, as it should because there are sleep periods inserted into the function, and it was running in a tmux session. Throughout the process, the script successfully output the print messages we expect from the function. The print statements in the script itself, outside of the function, also were present, such as Script complete
. An example snippet is the following saved to the log:
When I checked back on the process this morning, at the end of the log, I see something new:
I wonder why there is a KeyboardInterrupt
message at the end. I also do not know how to interpret the error. I also checked the pods status again, and they are still Running
.
Overall, the workflow has certainly made a big step forward. I think it is safe to say that the job is running successfully now, and there seems to be something funky going on with how the pods are shutting down after the script completes.
A note for the last comment: The KeyboardInterrupt
message at the end of the log was leftover from a previous run, when I cancelled the command to run the script due to issues. Should have realized the log is not replaced by the new log, but rather is appended to with each run, as is the case with the normal viz run as well.
Since the other tickets regarding setting up a new user and an env in the container have been resolved, there may not be a "connection issue" between parsl and k8s anymore, but rather just some adjustments to be made to write the log and the output viz tilesets. The workflow is not running smoothly yet, but I will be able to pinpoint the smaller issue better now that we have the new user and env set up in the container.
Progress
The parsl and kubernetes viz workflow has been progressing nicely in the following ways:
WORKDIR
specified in the Dockerfileapp/
for theWORKDIR
and/mnt/data
for the PVparsl_config.py
to a lower line because that script is often updated with a new version number for the published image, and moving the pip install line to right after copying over therequirements.txt
)runinfo
directory is created each run (in the dir of the python script on Datateam, not in the container or PV), which is a sign parsl is working behind the scenes to some degreepython parsl_workflow.py > k8s_parsl.log 2>&1
Problems
viz output files are not writing to the specified persistent directory
the pods are not shutting down by themselves
when you check the pods with
kubectl get pods
, their status isCrashLoopBackOff
checking the log of a specific pod with
kubectl logs {podname}
returns the print statement that we included in the workflow ("Worker started...") plus a vague syntax error:A good sign
Print statements inserted into the script at all stages are being printed to the log output we specify when we run the python script (in the example command given above, that is
k8s_parsl.log
), including the final statement "script complete".When running the parsl and kubernetes workflow with a parsl app that does not ingest data files nor output files, and instead executes a simple iterative mathematical operation with print statements and sleep periods inserted, the output seems to imply the script worked as expected. However, the pods are still not shutting down after, and the
CrashLoopBackOff
status is still the case.Useful Commands
kubectl run -n pdgrun -i --tty --rm busybox --image=busybox -- sh
initates a pod and inserts you into the pod so you can poke around. This pod is namedbusybox
by default. An example of one way to "poke around" is to ping the IP address you specify in the parsl config to see if it is open. Example:telnet 128.111.85.174 54001
Suggested next steps
in the parsl config and comment out the line we have been using thus far:
worker_init = 'echo "Worker started..."',
to get more info in the pod logsparsl_config.py
, play around with usingaddress = address_by_route(),
vsaddress='128.111.85.174',
Thank you to Matthew Brook and Matt Jones for all your help troubleshooting thus far!