Incorrect results being reported.

Disclaimer:

Bugs are highly dependent on the hardware used. Include as many Details as possible.

Checklist

[x] Using the latest Version (branch develop) - old Versions aren't supported
[x] Running the Script on bare metal (no VM)
[x] Manually executing python /pyTAB/hwi.py works
[x] The issue occurs AFTER the Benchmark has started

Description

Provide a clear and concise description of the bug. Multiple commands run under each test, however there are only results for one reported to the server.

Steps to Reproduce

Run the thing

Expected Behavior

Each test should have one and exactly one result.

Actual Behavior

The tests reported by the server contain multiple commands. These commands are not equal, and appear to be trying to encode to different target formats (ie: one tries to encode to h264 and the other to h265). Then the results section only has the results of the most recent run. This seems to completely invalidate the purpose of having a test, as not only are two completely different capabilities being tested, but then we're only reporting results for one of them.

Environment (important!)

OS: Arch
Python Version: 3.12.4
CPU(s): Ryzen 7 5800X
GPU(s): AMD RX580
RAM: 32GB DDR4

Additional Context

Live run output:

Running test: d5c7f3fe-09ea-3572-f81a-7f33b3d75ab0
> > > Current Device: amd
> > > > Workers: 1, Last Speed: -0.5
> > > > Workers: 6, Last Speed: 5.296
> > > > Scaling back to: 5, Last Speed: 0.8728617948717949
> > > > Scaleback success! Limit: False, Total Workers: 5, Speed: 1.0502419913419914
> > > > Failed: ['performance']
Running test: d5c7f3fe-09ea-3572-f81a-7f33b3d75ab0
> > > Current Device: amd
> > > > Workers: 1, Last Speed: -0.5
> > > > Failed: ['generic_ffmpeg_failure']

I added an extra debug statement to print the current test ID. Can clearly see that this test ran twice, indicating two "commands". The first one (encoding to h264, presumably) reported a max stream value of 5. The second one (encoding to h265, presumably) failed entirely.

output.json:

 {
    "id": "d5c7f3fe-09ea-3572-f81a-7f33b3d75ab0",
    "type": "amd",
    "selected_gpu": 0,
    "selected_cpu": null,
    "runs": [
        {
            "workers": 1,
            "frame": 900,
            "speed": 5.296,
            "time_s": 5.509,
            "rss_kb": 295244.0,
            "avgFPS": 158.8
        }
    ],
    "results": {
        "max_streams": 1,
        "failure_reasons": [
            "performance"
        ],
        "single_worker_speed": 5.296,
        "single_worker_rss_kb": 295244.0
    }
},

However here in the output.json that gets uploaded to the server, my max streams are being reported as 1. We've lost the max of 5 from the first run.

Possible Solution

(Optional) Suggest a fix or the cause of the bug. IMO, this is more a design flaw of the system, and not an implementation specific problem with pyTAB itself. Each test should do exactly one thing, I'm not sure why it seems to be designed to do two unrelated things. Having multiple "commands" under the same test ID ends with results getting overwritten, which invalidates the results and makes all the results currently in the server suspect at best.

BotBlake / pyTAB