[Bug] Evaluation can't stop running on self-host agenta

shhhhhh2001 commented 3 months ago

Describe the bug Any kind of evaluation can't stop running on self-host agenta.

To Reproduce Steps to reproduce the behavior:

In CLI, run "docker compose -f docker-compose.gh.yml up -d --pull always" in production mode.
Create any variant and any evaluator.
Start a new evaluation with any evaluator(e.g. Build-in Exact Match)
See error

Expected behavior Complete running very quickly if there are 3 data entries (e.g. 3s on the cloud).

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64)
Browser: Firefox on win10(Ubuntu is the server's OS where we deploy Agenta. I access it through ssh -L and use UI on Win10)
Version: 126.0.1

Important Context It worked well before you revised the code evaluator yesterday. All the variants and datasets on the screenshots are the same between the cloud and our host.

Network Logs in the Browser:
- If applicable, Take a screenshot of the network logs or copy and paste any relevant log entries.
Docker Containers Information:
- If applicable, Provide a screenshot showing the list of running and stopped containers.
- Please provide us the with a screenshot of the logs for the agenta-backend docker container.
Additional Information:
- Include any additional details, error messages, or observations that may be helpful.

sometimes, the error 502 will show up in the UI, but there's no log in the agenta-backend docker container's log. After deleting the evaluation job on UI, it throws out 204 and 500.

Please make sure to provide all the requested information to expedite the debugging process. Thank you for helping us improve our project!

dosubot[bot] commented 3 months ago

The issue with evaluations not stopping on self-hosted Agenta could be related to the configuration of the Celery worker. The celery_worker service is responsible for handling background tasks, including evaluations. If the Celery worker is not properly configured or not running, evaluations may not be processed correctly.

To resolve this issue, ensure that the celery_worker service is correctly configured and running. Here is the relevant part of the docker-compose.gh.yml file for the celery_worker service:

celery_worker:
    container_name: celery-worker-1
    image: ghcr.io/agenta-ai/agenta-backend
    command: >
        celery -A agenta_backend.main.celery_app worker --concurrency=1 --loglevel=INFO
    environment:
        - MONGODB_URI=mongodb://username:password@mongo:27017
        - REDIS_URL=redis://redis:6379/0
        - CELERY_BROKER_URL=amqp://guest@rabbitmq//
        - CELERY_RESULT_BACKEND=redis://redis:6379/0
        - FEATURE_FLAG=oss
    volumes:
        - ./agenta-backend/agenta_backend:/app/agenta_backend
        - /var/run/docker.sock:/var/run/docker.sock
    depends_on:
        - mongo
        - rabbitmq
        - redis
    extra_hosts:
        - "host.docker.internal:host-gateway"
    networks:
        - agenta-network

Ensure that the celery_worker service is up and running by checking the status of the Docker containers:

docker-compose -f docker-compose.gh.yml ps

If the celery_worker service is not running, you can start it with:

docker-compose -f docker-compose.gh.yml up -d celery_worker

Additionally, check the logs for the celery_worker service to identify any issues:

docker-compose -f docker-compose.gh.yml logs celery_worker

If the celery_worker service is running correctly and the issue persists, it might be related to the evaluation status management in the code. The runningStatuses constant in the agenta-web/src/services/evaluations/index.ts file defines the statuses that indicate an evaluation is still running. Ensure that the evaluation status is being updated correctly in the backend and that the frontend is correctly interpreting these statuses.

Moreover, you should also check the backend code, specifically in the agenta-backend/agenta_backend/routers/evaluation_router.py file, to ensure that the evaluation status is being fetched and updated correctly. The fetch_evaluation_status endpoint is responsible for returning the status of an evaluation. Make sure that this endpoint is functioning correctly and that the status is being updated in the database.

Additionally, there might be issues with the CLI command for running evaluations. The run_evaluation_cli function in the agenta-cli/agenta/cli/evaluation_commands.py file uses hardcoded values for the app_name and host parameters. Ensure that these values are correctly set and that the CLI command is functioning as expected:

@click.group()
def evaluation():
    """Commands for evaluations."""
    pass

@evaluation.command(name="run")
def run_evaluation_cli():
    """Run an evaluation."""
    try:
        client.run_evaluation(
            app_name="sss",
            host="http://localhost",
        )
    except Exception as ex:
        click.echo(click.style(f"Error while running evaluation: {ex}", fg="red"))

By ensuring that the celery_worker service is properly configured and running, verifying the evaluation status management in the code, checking the backend endpoints, and ensuring the CLI command is correctly configured, you should be able to resolve the issue with evaluations not stopping on self-hosted Agenta ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

shhhhhh2001 commented 3 months ago

mmabrouk commented 3 months ago

Hi @shhhhhh2001 we might have an issue in the gh docker compose. We are looking into it. In the mean time, can you please run agenta using the development mode flow, this should fix the issue:

git clone https://github.com/Agenta-AI/agenta.git
cd agenta
docker compose -f "docker-compose.yml" up -d --build

Agenta-AI / agenta

[Bug] Evaluation can't stop running on self-host agenta #1755