Closed orfeas-k closed 1 month ago
Same results on latest/edge
Going through the aproxy logs and the artifacts' ketall
logs, I think that they are irrelevant to the failing jobs in training-integration since
2023-11-17T15:56:37Z aproxy.aproxy[8778]: 2023/11/17 15:56:37 ERROR failed to send HTTP response to connection src=10.1.56.201:39184 original_dst=169.254.169.254:80 host=169.254.169.254:80 error="write tcp 10.8.232.2:8443->10.1.56.201:39184: write: broken pipe"
2023-11-17T15:56:37Z aproxy.aproxy[8778]: 2023/11/17 15:56:37 ERROR failed to send HTTP response to connection src=10.1.56.201:39172 original_dst=169.254.169.254:80 host=169.254.169.254:80 error="write tcp 10.8.232.2:8443->10.1.56.201:39172: write: broken pipe"
2023-11-17T15:56:39Z aproxy.aproxy[8778]: 2023/11/17 15:56:39 INFO relay connection to http proxy src=10.1.56.201:54444 original_dst=151.101.64.223:443 host=pypi.org:443
2023-11-17T15:56:39Z aproxy.aproxy[8778]: 2023/11/17 15:56:39 INFO relay connection to http proxy src=10.1.56.201:53514 original_dst=199.232.53.55:443 host=files.pythonhosted.org:443
2023-11-17T15:56:42Z aproxy.aproxy[8778]: 2023/11/17 15:56:42 ERROR failed to preread host from connection src=10.1.56.199:56028 original_dst=169.254.169.254:80 error="failed to preread HTTP request: EOF"
I see 10.8.232.2 being the HostIP for a container of many pods,
10.1.56.201
being the PodIP in a container of pod test-kubeflow-sth which also has an annotation cni.projectcalico.org/podIP: 10.1.56.201/32
.10.1.56.199
is the PodIP of the initcontainer of the following pod
name: ml-pipeline-visualizationserver-7b5889796d-zx4kr
namespace: test-kubeflow
Thus I dont think they relate somehow to the reason the tfjob
fails.
Looking at the image's code, I don't see sth specific that could require internet connection. My best guess at the moment is that a pod fails to talk to another pod spun by the job.
Removing the test's teardown so we can see the state of the cluster when the job failed, I don't see any failed pods. Only those that are in a Running state.
test-kubeflow pod/pytorch-dist-mnist-gloo-master-0 1/1 Running 0 5m9s
test-kubeflow pod/pytorch-dist-mnist-gloo-worker-0 1/1 Running 0 5m7s
Reran the test after adding the following last line in Jobs' container as @nishant-dash did in order to get those working behind a proxy.
container = V1Container(
name=<name>,
image=<image>,
args=<args-list>,
env=[V1EnvVar(name="https_proxy", value="http://squid.internal:3128"), V1EnvVar(name="http_proxy", value="http://squid.internal:3128"), V1EnvVar(name="HTTPS_PROXY", value="http://squid.internal:3128"), V1EnvVar(name="HTTP_PROXY", value="http://squid.internal:3128")],
)
but they failed with a different error
tenacity.RetryError: RetryError[<Future at 0x7f533e25a5e0 state=finished returned bool>]
It's also notable that the job took 20 minutes more to complete this time which could mean that it failed on a later stage (another job maybe). At this point, we should schedule a call witih IS in order to SSH into the instance and find the cause of the above failure.
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5734.
This message was autogenerated
Currently we are not planing to work with self hosted runners.
Bug Description
Running the training-integration UAT notebook on a self-hosted runner fails with AssertionError: Job mnist was not successful.
Note that the same UAT succeeds on an EC2 instance
To Reproduce
Run the Deploy CKF bundle and run UATs workflow with
tests-bundle/1.8
as bundle tests,--file=releases/latest/edge/bundle.yaml
as bundle source,1.25-strict/stable
as microk8s-version,3.1/stable
as juju-version andfeature-orfeas-lightkube-trustenv
as the uats branch.Environment
1.25-strict/stable
3.1/stable
latest/edge
Relevant Log Output
Additional Context
No response