Failing to run bundle_v2 integration tests running out of space

misohu commented 6 months ago

Bug Description

When running CI the test bundle v2 fails. After sshing into the runner we found out the main problem is insufficient amount of disk space causing pods stopping in pending state with

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m13s  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

df command from inside the runner:

  runner@fv-az572-42:~/work/kfp-operators/kfp-operators$ df
Filesystem                  1K-blocks     Used Available Use% Mounted on
/dev/root                    76026616 74390372   1619860  99% /
devtmpfs                      8183156        0   8183156   0% /dev
tmpfs                         8187672        4   8187668   1% /dev/shm
tmpfs                         1637536     3212   1634324   1% /run
tmpfs                            5120        0      5120   0% /run/lock
tmpfs                         8187672        0   8187672   0% /sys/fs/cgroup
/dev/sdb15                     106858     6186    100673   6% /boot/efi
/dev/loop0                      65536    65536         0 100% /snap/core20/2182
/dev/loop1                      40064    40064         0 100% /snap/snapd/21184
/dev/loop2                      94080    94080         0 100% /snap/lxd/24061
/dev/sda1                    76829444 71714284   1166720  99% /mnt
tmpfs                         1637532        0   1637532   0% /run/user/1001
/dev/mapper/buildvg-buildlv  76661516   264724  76380408   1% /home/runner/work/kfp-operators/kfp-operators
/dev/loop5                     106496   106496         0 100% /snap/core/16928
/dev/loop6                      76032    76032         0 100% /snap/core22/1122
/dev/loop7                     152192   152192         0 100% /snap/lxd/27049
tmpfs                            1024        0      1024   0% /var/snap/lxd/common/ns
/dev/loop8                      93568    93568         0 100% /snap/juju/25751
/dev/loop9                        256      256         0 100% /snap/jq/6
/dev/loop10                     28032    28032         0 100% /snap/charm/712
/dev/loop11                     29312    29312         0 100% /snap/charmcraft/2453
/dev/loop12                      1536     1536         0 100% /snap/juju-bundle/25
/dev/loop13                     12544    12544         0 100% /snap/juju-crashdump/271
/dev/loop14                     57088    57088         0 100% /snap/core18/2812
/dev/loop15                    167552   167552         0 100% /snap/microk8s/6575
/dev/loop16                     12288    12288         0 100% /snap/kubectl/3206

This issue is similar to this one https://github.com/canonical/bundle-kubeflow/issues/813

To Reproduce

open pr from main against main to trigger CI
check the bundle v2 integration test

Environment

Github actions CI in main branch

Relevant Log Output

Added 'kubeflow' model on microk8s/localhost with credential 'microk8s' for user 'admin'
bundle-integration-v2: install_deps> python -I -m pip install 'kfp<3.0,>=2.4' -r requirements-integration.txt
bundle-integration-v2: freeze> python -m pip freeze --all
bundle-integration-v2: aiohttp==3.8.5,aiosignal==1.3.1,anyio==4.0.0,asttokens==2.4.0,async-timeout==4.0.3,attrs==23.1.0,backcall==0.2.0,bcrypt==4.0.1,cachetools==5.3.1,certifi==2023.7.22,cffi==1.15.1,charset-normalizer==3.2.0,click==8.1.7,cryptography==41.0.3,decorator==5.1.1,docstring_parser==0.16,exceptiongroup==1.1.3,executing==1.2.0,frozenlist==1.4.0,google-api-core==2.18.0,google-auth==2.22.0,google-cloud-core==2.4.1,google-cloud-storage==2.11.0,google-crc32c==1.5.0,google-resumable-media==2.7.0,googleapis-common-protos==1.63.0,h11==0.14.0,httpcore==0.17.3,httpx==0.24.1,hvac==1.2.0,idna==3.4,importlib-resources==6.0.1,iniconfig==2.0.0,ipdb==0.13.13,ipython==8.12.2,jedi==0.19.0,Jinja2==3.1.2,jsonschema==4.17.3,juju==3.2.2,kfp==2.5.0,kfp-pipeline-spec==0.2.2,kfp-server-api==2.0.5,kubernetes==25.3.0,lightkube==0.14.0,lightkube-models==1.28.1.4,macaroonbakery==1.3.1,MarkupSafe==2.1.3,matplotlib-inline==0.1.6,multidict==6.0.4,mypy-extensions==1.0.0,oauthlib==3.2.2,packaging==23.1,paramiko==2.12.0,parso==0.8.3,pexpect==4.8.0,pickleshare==0.7.5,pip==24.0,pkgutil_resolve_name==1.3.10,pluggy==1.3.0,prompt-toolkit==3.0.39,proto-plus==1.23.0,protobuf==3.20.3,ptyprocess==0.7.0,pure-eval==0.2.2,pyasn1==0.5.0,pyasn1-modules==0.3.0,pycparser==2.21,Pygments==2.16.1,pyhcl==0.4.5,pymacaroons==0.13.0,PyNaCl==1.5.0,pyRFC3339==1.1,pyrsistent==0.19.3,pytest==7.4.2,pytest-asyncio==0.21.1,pytest-operator==0.29.0,python-dateutil==2.8.2,pytz==2023.3.post1,PyYAML==6.0.1,requests==2.31.0,requests-oauthlib==1.3.1,requests-toolbelt==0.10.1,rsa==4.9,setuptools==69.1.0,six==1.16.0,sniffio==1.3.0,stack-data==0.6.2,tabulate==0.9.0,tenacity==8.2.3,tomli==2.0.1,toposort==1.10,traitlets==5.9.0,typing-inspect==0.9.0,typing_extensions==4.7.1,urllib3==1.26.16,wcwidth==0.2.6,websocket-client==1.6.2,websockets==8.1,wheel==0.42.0,yarl==1.9.2,zipp==3.16.2
bundle-integration-v2: commands[0]> pytest -vv --tb=native -s --model kubeflow --bundle=./tests/integration/bundles/kfp_latest_edge.yaml.j2 --destructive-mode /home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.4.2, pluggy-1.3.0 -- /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/bin/python
cachedir: .tox/bundle-integration-v2/.pytest_cache
rootdir: /home/runner/work/kfp-operators/kfp-operators
configfile: pyproject.toml
plugins: operator-0.29.0, asyncio-0.21.1, anyio-4.0.0
asyncio: mode=strict
collecting ... collected 6 items

tests/integration/test_kfp_functional_v2.py::test_build_and_deploy PASSED
tests/integration/test_kfp_functional_v2.py::test_upload_pipeline Forwarding from 127.0.0.1:8080 -> 3000
Forwarding from [::1]:8080 -> 3000
Handling connection for 8080
Handling connection for 8080
PASSED
tests/integration/test_kfp_functional_v2.py::test_create_and_monitor_run Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Experiment details: http://localhost:8080/#/experiments/details/0a41ffed-9690-49f3-ba6f-6e279fe24755
Experiment details: http://localhost:8080/#/experiments/details/0a41ffed-9690-49f3-ba6f-6e279fe24755
Run details: http://localhost:8080/#/runs/details/9e859b91-47fa-4fe9-a9f1-45d33207a1aa
FAILEDHandling connection for 8080

tests/integration/test_kfp_functional_v2.py::test_create_and_monitor_recurring_run Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Experiment details: http://localhost:8080/#/experiments/details/97d9298f-cdf4-42d0-a25b-d230c7b091fa
PASSEDHandling connection for 8080
Handling connection for 8080
Handling connection for 8080

tests/integration/test_kfp_functional_v2.py::test_apply_sample_viewer PASSED
tests/integration/test_kfp_functional_v2.py::test_viz_server_healthcheck PASSED

=================================== FAILURES ===================================
_________________________ test_create_and_monitor_run __________________________
Traceback (most recent call last):
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 341, in from_call
    result: Optional[TResult] = func()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 262, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 152, in _multicall
    return outcome.get_result()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_result.py", line 114, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 177, in pytest_runtest_call
    raise e
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/runner.py", line 169, in pytest_runtest_call
    item.runtest()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/python.py", line 1792, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 152, in _multicall
    return outcome.get_result()
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_result.py", line 114, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/pytest_asyncio/plugin.py", line 532, in inner
    _loop.run_until_complete(task)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py", line 154, in test_create_and_monitor_run
    assert monitor_response.state == "SUCCEEDED"
AssertionError: assert 'FAILED' == 'SUCCEEDED'
  - SUCCEEDED
  + FAILED
------------------------------ Captured log setup ------------------------------
INFO     root:client.py:470 Creating experiment test-experiment.
------------------------------ Captured log call -------------------------------
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
INFO     root:client.py:1379 Waiting for the job to complete...
=============================== warnings summary ===============================
tests/integration/test_kfp_functional_v2.py::test_upload_pipeline
  /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/kfp/client/client.py:159: FutureWarning: This client only works with Kubeflow Pipeline v2.0.0-beta.2 and later versions.
    warnings.warn(

tests/integration/test_kfp_functional_v2.py: 37 warnings
  /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/kfp_server_api/rest.py:47: DeprecationWarning: HTTPResponse.getheader() is deprecated and will be removed in urllib3 v2.1.0. Instead use HTTPResponse.headers.get(name, default).
    return self.urllib3_response.getheader(name, default)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/integration/test_kfp_functional_v2.py::test_create_and_monitor_run
============ 1 failed, 5 passed, 38 warnings in 1395.30s (0:23:15) =============
Task was destroyed but it is pending!
task: <Task pending name='Task_Pinger' coro=<Connection._pinger() running at /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/juju/client/connection.py:619> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f173b6956a0>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.8/asyncio/tasks.py:769, gather.<locals>._done_callback() at /usr/lib/python3.8/asyncio/tasks.py:769]>
Task was destroyed but it is pending!
task: <Task pending name='Task_Receiver' coro=<Connection._receiver() running at /home/runner/work/kfp-operators/kfp-operators/.tox/bundle-integration-v2/lib/python3.8/site-packages/juju/client/connection.py:570> wait_for=<Future finished result=None> cb=[gather.<locals>._done_callback() at /usr/lib/python3.8/asyncio/tasks.py:769, gather.<locals>._done_callback() at /usr/lib/python3.8/asyncio/tasks.py:769]>
Task was destroyed but it is pending!
task: <Task pending name='Task-12953' coro=<Event.wait() done, defined at /usr/lib/python3.8/asyncio/locks.py:296> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f173c05a7c0>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.8/asyncio/tasks.py:769]>
bundle-integration-v2: exit 1 (1396.61 seconds) /home/runner/work/kfp-operators/kfp-operators> pytest -vv --tb=native -s --model kubeflow --bundle=./tests/integration/bundles/kfp_latest_edge.yaml.j2 --destructive-mode /home/runner/work/kfp-operators/kfp-operators/tests/integration/test_kfp_functional_v2.py pid=16930
  bundle-integration-v2: FAIL code 1 (1417.28=setup[20.67]+cmd[1396.61] seconds)
  evaluation failed :( (1417.62 seconds)
Error: Process completed with exit code 1.

Additional Context

No response

syncronize-issues-to-jira[bot] commented 6 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5553.

This message was autogenerated

ca-scribner commented 6 months ago

I think this has occurred because of at least two things:

the images used by KFP have gotten larger over time (ex: gcr.io/ml-pipeline/visualization-server:2.0.3 ~=5.2GB and in past it was closer to 4.4GB)
the easimon/maximize-build-space action no longer frees up as much space as it used to (see this comment)

It appears that because of (2), the 2.0.3 track here just has enough space to run the tests, and combining (1)+(2) means that when we create a user profile, the runner runs out of space while deploying the visualization and artifact server pods in the user's namespace.

A possible solution to this issue is to switch to the the jlumbroso/free-disk-space action which, with default settings, leaves the runner with ~45GB free.

kimwnasptd commented 6 months ago

Nice quick way for unblocking us @ca-scribner!

For the long term solution I propose that we'll go with self-hosted runners https://github.com/canonical/kfp-operators/pull/428#issuecomment-2046961785

I had tried to play a bit around with those in https://github.com/canonical/kfp-operators/pull/415 and https://github.com/canonical/kfp-operators/pull/414. I'll do a cleanup and have also a dedicated PR and issue for this so we laser focus it on the changes we'll need to do holistically.

I'll have them ready by the sprint so that we can sit down with IS team and show them our blockers.

I'll add a comment here as well so that the lineage of the effort is tracked.

canonical / kfp-operators