[ML] Create a Docker build image using the latest PyTorch stable branch.

elastic / ml-cpp

Machine learning C++ code

Other

149 stars 62 forks source link

[ML] Create a Docker build image using the latest PyTorch stable branch. #2685

Closed edsavage closed 1 month ago

edsavage commented 1 month ago

On Linux x86_64, provide the ability to create Docker build images that use the code pulled from the latest PyTorch stable release branch (currently 2.3.1)

Upon successful build and push of such a Docker image, trigger a build of the ml-cpp code in it.

Once the ml-cpp build has succeeded it in turn should trigger a pipeline in the QAF repo that runs a set of tests exercising the pytorch_inference executable.

To make this possible a new pipeline - ml-cpp-pytorch-build - is required. This is defined in the catalog-info.yaml file and won't be created until catalog-info.yaml is merged to main and backstage magic does its thing.

Some tweaks to our existing Buildkite framework have been made in order that existing code can be better be re-used, so that e.g. just a linux x86_64 build step can be dynamically created.

TBD: The name of the QA PyTorch testing pipeline is required.

cla-checker-service[bot] commented 1 month ago

💚 CLA has been signed

edsavage commented 1 month ago

To manually test the changes in this PR

Go to https://buildkite.com/elastic/ml-cpp-pr-builds and click on the New Build button.
Add a descriptive message in the message field
Add the ID of the latest commit in the commit field
Add the name of this PR branch - pytorch_latest_docker_build - in the branch field

Click on `Options and in the Environment variables box add:

DOCKER_IMAGE="docker.elastic.co/ml-dev/ml-linux-dependency-build:pytorch_231"
GITHUB_PR_COMMENT_VAR_ACTION="run_pytorch_tests"
GITHUB_PR_COMMENT_VAR_ARCH="x86_64"
GITHUB_PR_COMMENT_VAR_PLATFORM="linux"
GITHUB_PR_TRIGGER_COMMENT=""

Click on Create Build

edsavage commented 1 month ago

buildkite build this

edsavage commented 1 month ago

The PyTorch releases are branched of viable/strict. So, the way I understand the intend of this PR, we should rebuild the Docker image nightly from this PyTorch branch to identify the new issues, before we will start depending on a new PyTorch release, shouldn't we?

Thanks @valeriy42 ! Yes, the I believe the viable/strict branch is the most appropriate one for us to be building and testing against. Originally we had been targeting main, but that is too volatile. Our friends in QA suggested the latest release branch - 2.3.1, but as you've pointed out, that is pretty much static. So I think viable/strict is our "Goldilocks" branch - just right 🤞