aws / fmeval

Foundation Model Evaluations Library
http://aws.github.io/fmeval
Apache License 2.0
155 stars 40 forks source link

Unable to run in Docker container (unable to register worker with raylet) #183

Closed athewsey closed 4 months ago

athewsey commented 5 months ago

Hi team,

We're trying to build a containerized Streamlit app using fmeval, but evaluation is dying with:

INFO worker.py:1642 -- Started a local Ray instance.
core_worker.cc:203: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

The Dockerfile is nothing fancy - based on python:3.10 base:

FROM --platform=linux/amd64 python:3.10

WORKDIR /usr/src/app
COPY src/requirements.txt ./requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY src/* ./

EXPOSE 8501

HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1

ENV AWS_DEFAULT_REGION=us-east-1

ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Our installs are also minimal:

fmeval==0.3.0
# Explicit pandas pin for: https://github.com/ray-project/ray/issues/42572
pandas<2.2.0
streamlit==1.30.0

The app works fine locally (outside of Docker). Can anybody suggest what extra configs or dependencies are needed to run properly in Docker? I'm exploring whether switching to a rayproject/ray-based image helps, but it introduces some other initial errors and would be much better to know what the actual requirements are than being tied to one base image.

keerthanvasist commented 4 months ago

Hey Alex!

What version grpcio do you have installed? I had trouble in earlier versions (<1.60.0). We haven't seen this particular issue before but I found this issue on Ray where some customers have this issue but it's been non-reproducible so far.

keerthanvasist commented 4 months ago

Is there a quick reproducible example you can share with me? I will try to use the data you've given me so far, and try to reproduce it.

keerthanvasist commented 4 months ago

Okay I tried to build a container with the same project requirements as you with the following app.py

import ray
import streamlit as st

if not ray.is_initialized():
    ray.init()

x = st.slider("Select a value")
st.write(x, "squared is", x * x)

I was able to build that container, and run it. I didn't run into the error you are running into. Admittedly, I am running the container on my mac which is linux/arm64/v8 machine and on an x86 cloud machine. I notice you are using a linux/amd64. This could be one of the reasons. Would you be able to retry a different platform?

athewsey commented 4 months ago

Hey @keerthanvasist thanks for following up!

I spent some time poking around that same Ray issue and tried installing grpcio>=1.60.0,<2 but it didn't help.

Previously I was running --platform linux/amd64 as emulation (on my M2 Mac) to match our target environment (Amazon ECS linux/amd64) - but I saw from https://github.com/ray-project/ray/issues/25300 there might be issues with this and the good news is:

For now my takeaway is that it's the platform emulation that's not possible, so we'll just have to test around that. Thanks for your help on it!