Add Gemma Flex Template example

jrmccluskey commented 1 week ago

Description

Adds a Gemma Flex Template example and an e2e test running on Dataflow. This code example is similar to #11284, but using a Pytorch model and deploying as a flex template. The e2e test will need model weights staged to GCS like the streaming Gemma example.

Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.

Checklist

[ ] I have followed Sample Guidelines from AUTHORING_GUIDE.MD
[ ] README is updated to include all relevant information
[ ] Tests pass: nox -s py-3.9 (see Test Environment Setup)
[ ] Lint pass: nox -s lint (see Test Environment Setup)
[ ] These samples need a new API enabled in testing projects to pass (let us know which ones)
[ ] These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
[ ] This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
[ ] This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
[ ] Please merge this PR for me once it is approved

jrmccluskey commented 1 week ago

Not sure what happened on the kokoro test run, the test target passed but the test execution as a whole was killed right after the test passed

jrmccluskey commented 1 week ago

Not sure what's happening here, test passes but the test session gets killed consistently. @engelke can you take a look?

jrmccluskey commented 6 days ago

@kweinmeister Still need an approving review from someone if you could take a look

jrmccluskey commented 5 days ago

If the complaint is with the model handler code I don't think it's too much of a change to cut that code in favor of linking to the source instead.

glasnt commented 5 days ago

Debugging the tests: the output shows it's a timeout, but the tests are successful(?)

collecting ... collected 1 item

e2e_test.py::test_pipeline_dataflow PASSED                               [100%]

-- generated xml file: /workspace/dataflow/gemma-flex-template/sponge_log.xml --
======================== 1 passed in 3642.73s (1:00:42) ========================
nox > Session py-3.10 was successful.

err: signal: killed

The kokoro config is set to a max of 60 min (config). And you've configured the test to have a 5400s (2 hour) timeout.

At a guess, while the image is created each time (~20 mins) and it takes time for the job to start (~20 mins), the success message isn't being received, and this the system has 20 minutes of wait before it times out.

How long is this entire e2e test expected to be, and is the 2h wait there intentionally? Something else will need to be updated for that decorator to be respected.

jrmccluskey commented 4 days ago

As far as the E2E test timeout, in early testing we were dancing around the hour-mark as far as runs (some slightly under, some slightly over) so it definitely needs to be over an hour. Building the container + running the job as an invocation from a flex template takes substantial time, so we may need a little longer on the kokoro timeout

tvalentyn commented 4 days ago

Building the container + running the job as an invocation from a flex template takes substantial time.

You can significantly bring this down by not including the model and the GPU software into the flex template image. This is a scenario where having two separate images, one for the flex template launcher and one for the custom container image would be better. Care should be taken to build the images with the same set of dependencies, which can be accomplished with requirements files and/or constraint files.

tvalentyn commented 4 days ago

Also, we can speed up launch time by download the model into the SDK worker container from GCS during container startup, instead of shipping it inside the contanier. Currently, this could be done by using a custom entrypoint like https://github.com/liferoad/beamllm/blob/main/containers/ollama/entrypoint.sh, eventually we will have a Beam API for that.

tvalentyn commented 4 days ago

including the model in the container will be less prone to a runtime error, but slower in short-term future.

GoogleCloudPlatform / python-docs-samples

Add Gemma Flex Template example #11881

Description

Checklist