Job status updates send inconsistent numbers

onmyraedar commented 1 day ago

A job generated the following status dict. The job is marked as completed because the number of completed and total interviews are equal (both 0). However, the language model progress shows an outstanding interview.


{
    "overall_progress": {"completed": 0, "total": 0, "percent": 0},
    "language_model_progress": {
        "gpt-4o-mini": {"completed": 0, "total": 1, "percent": 0.0}
    },
    "statistics": {
        "Elapsed time": "0.0 sec.",
        "Total interviews requested": "0 ",
        "Completed interviews": "0 ",
        "Average time per interview": "0.00 sec.",
        "Estimated time remaining": "0.0 sec.",
        "Exceptions": "0 ",
        "Unfixed exceptions": "0 ",
        "Throughput": "0.00 interviews/sec.",
    },
    "status": "completed",
    "language_model_queues": {
        "gpt-4o-mini": {
            "language_model_name": "gpt-4o-mini",
            "requests_bucket": {
                "completed": 0,
                "requested": 0,
                "tokens_returned": 0,
                "target_rate": 10000.0,
                "current_rate": 0.0,
            },
            "tokens_bucket": {
                "completed": 0,
                "requested": 0,
                "tokens_returned": 0,
                "target_rate": 2000000.0,
                "current_rate": 0.0,
            },
        }
    },
    "job_id": "c91ad3bc-b526-4809-9c2f-3de68315cedd",
}

Do we have any idea why this is happening and how to fix it? The progress bar on the frontend stops checking for updates after status == "completed", meaning that the progress bar stops at this:

johnjosephhorton commented 1 day ago

Thanks - Is this with hosted inference running or local running? And if local, was it run just now? I made some changes to job-running that could be the culprit...

onmyraedar commented 1 day ago

This is with a remote job, running in my local dev environment on Docker.

I've only started seeing it today. We're on edsl 0.1.38.dev3, not the main branch - I don't think recent updates have been applied:

edsl = {allow-prereleases = true, version = "^0.1.35"}

onmyraedar commented 1 day ago

Worth noting that this numerical inconsistency only seems to happen at the very beginning of a job. Usually, I access the progress bar from the remote inference dashboard, which takes a few seconds. Today, I've been working on a clickable link for opening the progress bar from EDSL - meaning that I can now see the job earlier, without the delay. Might be related to the issue.

expectedparrot / edsl

Job status updates send inconsistent numbers #1317