AICoE / aicoe-ci

AICoE-CI using TektonCD pipelines and triggers
13 stars 13 forks source link

image build pipeline steps failing with OOMKilled #130

Closed erikerlandson closed 2 years ago

erikerlandson commented 3 years ago

Describe the bug trying to run an image build: https://github.com/thoth-station/ray-ml-worker/issues/11

The build pipeline task is running out of memory

terminated:
  containerID: 'cri-o://c3c717be8b69f79fb115c3db81bd3857d1c66166cab30b47b958c166c326f961'
  exitCode: 0
  finishedAt: '2021-08-18T13:48:40Z'
  reason: OOMKilled
  startedAt: '2021-08-18T13:43:52Z'

To Reproduce Steps to reproduce the behavior: Build the pipeline as described in the above 'deliver image build' issue The pipeline should fail in the image build step with OOMKilled

Expected behavior Builds requiring more memory should run to completion

goern commented 3 years ago

/kind bug /priority critical-urgent

@erikerlandson @harshad16 any updates on this?

goern commented 2 years ago

@erikerlandson @harshad16 any updates on this?

goern commented 2 years ago

@harshad16 can Gregory give pipelines a little bit more memory?

erikerlandson commented 2 years ago

based on previous image builds in this space, I am guessing the build will need >= 8GB to run

goern commented 2 years ago

ping? @harshad16

harshad16 commented 2 years ago

This was fixed: https://github.com/thoth-station/ray-ml-worker/pull/13 and the image is released. https://quay.io/repository/thoth-station/ray-ml-worker?tab=tags

erikerlandson commented 2 years ago

thanks @harshad16 !