chromeos / cros-codecs

BSD 3-Clause "New" or "Revised" License
30 stars 11 forks source link

workflows: Add LAVA tests workflow #48

Closed cazou closed 1 year ago

cazou commented 1 year ago

cros-codecs CI workflow:

The ci/config.yaml contains the test configurations. For each architecture (amd/intel), it specifies:

The github action workflow uses ci/config.yaml to get the lava device_type to run the test on. The test on lava is run from ci/test-cases/run_tests.py and uses ci/config.yaml to generate the fluster command to run.

About the lava test job

google-cla[bot] commented 1 year ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

dwlsalmeida commented 1 year ago

The segfault is likely a driver fault, however it's possibly linked to bad input being issued from cros-codecs. Hard to say without a backtrace and other ways to reproduce.

If this is run against a debug version of the driver with tracing enabled we can probably identify the cause. For https://github.com/intel/media-driver they have the MOS_TRACE macro and friends that can be enabled with a compile flag.

cazou commented 1 year ago

@dwlsalmeida I used the acer-cp514-2h-1160g7-volteer device type on lava staging. Maybe there are better options with more stable drivers ?

dwlsalmeida commented 1 year ago

@cazou without a stacktrace it's hard to tell :/

I know very little about CI, but can't you ask it to run ccdec with valgrind or something? If I can get any insight into what part of the code is causing this I can fix it.

Also, it's very weird that absolutely nothing in VP9 is passing, definitely something is up.

dwlsalmeida commented 1 year ago

Summarizing what I know about the segfaults, taking vp90-2-02-size-lf-1920x1080.webm for example:

The source stride and the destination stride do not match for Tiger Lake:

#1  0x00007efe0b81fd38 in DdiMedia_CopyPlane (dst=0x563eff3510d0 "", dstPitch=960, src=0x563eff71ece0 "", srcPitch=1920, height=540) at /home/detlev/Sources/goo0028/media-driver/media_driver/linux/common/ddi/media_libva.cpp:5173

This causes a segfault here in the driver, when copying the V plane.

static void DdiMedia_CopyPlane(
    uint8_t *dst,
    uint32_t dstPitch,
    uint8_t *src,
    uint32_t srcPitch,
    uint32_t height)
{
    uint32_t rowSize = std::min(dstPitch, srcPitch);
    for (int y = 0; y < height; y += 1)
    {
        memcpy(dst, src, rowSize);
        dst += dstPitch;
        src += srcPitch;
    }
}

Running this on my Ice Lake machine, I get srcPitch==dstPitch==960.

I have no idea why this returns 1920 for the V plane on Tiger Lake for presumably i420 content:

        DdiMedia_GetChromaPitchHeight(DdiMedia_MediaFormatToOsFormat(surface->format), surface->iPitch, surface->iHeight, &chromaPitch, &chromaHeight);

My hunch is that, somehow, this line gets executed:

            *chromaPitch = pitch;

Instead of

            *chromaPitch = MOS_ALIGN_CEIL(pitch, 2) / 2;

In

static VAStatus DdiMedia_GetChromaPitchHeight(
    uint32_t fourcc,
    uint32_t pitch,
    uint32_t height,
    uint32_t *chromaPitch,
    uint32_t *chromaHeight)
{
    DDI_CHK_NULL(chromaPitch, "nullptr chromaPitch", VA_STATUS_ERROR_INVALID_PARAMETER);
    DDI_CHK_NULL(chromaHeight, "nullptr chromaHeight", VA_STATUS_ERROR_INVALID_PARAMETER);

    switch(fourcc)
    {
        case VA_FOURCC_NV12:
        case VA_FOURCC_P010:
        case VA_FOURCC_P012:
        case VA_FOURCC_P016:
            *chromaHeight = MOS_ALIGN_CEIL(height, 2) / 2;
            *chromaPitch = pitch;
            break;
        case VA_FOURCC_I420:
        case VA_FOURCC_YV12:
            *chromaHeight = MOS_ALIGN_CEIL(height, 2) / 2;
            *chromaPitch = MOS_ALIGN_CEIL(pitch, 2) / 2;
            break;
        case VA_FOURCC_411P:
        case VA_FOURCC_422H:
        case VA_FOURCC_444P:
        case VA_FOURCC_RGBP:
            *chromaHeight = height;
            *chromaPitch = pitch;
            break;
        case VA_FOURCC_422V:
        case VA_FOURCC_IMC3:
            *chromaHeight = MOS_ALIGN_CEIL(height, 2) / 2;
            *chromaPitch = pitch;
            break;
        default:
            *chromaPitch = 0;
            *chromaHeight = 0;
    }

    return VA_STATUS_SUCCESS;
}

Notably, we stopped indicating the fourcc when creating the surfaces some time ago:

        /// Create new surfaces and add them to the pool, using `descriptors` as backing memory.
        pub(crate) fn add_surfaces(&mut self, descriptors: Vec<M>) -> Result<(), VaError> {
            let surfaces = self.display.create_surfaces(
                self.rt_format,
                // Let the hardware decide the best internal format - we will get the desired fourcc
                // when creating the image.
                None, <-----------
                self.coded_resolution.width,
                self.coded_resolution.height,
                self.usage_hint,
                descriptors,
            )?;

It may very well be that the bug is in between the surface's internal format and the memcpy.

@cazou can you confirm the VA-API version? (vainfo)

Gnurou commented 1 year ago

Thanks for this, this is looking good!

I think one feature we will need is the ability to keep track of which tests are expected to fail, since we cannot reach 100% due to bugs in drivers and sometimes cros-codecs itself. So the CI job should be successful as long as no test regressed.

cazou commented 1 year ago

Running this on my Ice Lake machine, I get srcPitch==dstPitch==960.

I have no idea why this returns 1920 for the V plane on Tiger Lake for presumably i420 content:

@dwlsalmeida To be fully correct, this happens on debian on both hp-x360-12b-ca0500na-n4000-octopus and acer-cp514-2h-1160g7-volteer device types on lava. I was not able to reproduce this on my laptop running Gentoo (media-driver 23.2.4). But I was able to reproduce this on my laptop within a docker running debian.

@cazou can you confirm the VA-API version? (vainfo)

Gentoo:

libva info: VA-API version 1.18.0
libva info: Trying to open /usr/lib64/va/drivers/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_18
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.18 (libva 2.18.2)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.2.4 ()

Debian:

libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.1 ()

I edited the debian info, it was with one of my debug version, now it is the one from the debian package

CPU is 11th gen Intel i7-1185G7

cazou commented 1 year ago

It looks like an issue with the debian package. I tried on bullseye (oldstable), bookworm (stable) and trixe (unstable) and all have the same issue. But when I rebuild the driver from the git repo at the same version, it works correctly (no segfault). I will have a closer look at this debian package

cazou commented 1 year ago

I confirm that this is an issue with the way the package is built: Debian separates non-free libs from the other ones, so the intel-media-va-driver package is built with the -DENABLE_NONFREE_KERNELS=OFF flag. Apparently, with that flag, there is a bug that makes ccdec segfault.

A simple fix is to install intel-media-va-driver-non-free instead from the non-free apt source. We need to evaluate if that package can be installed on a rootfs that will be made publicly available.

With that package, the segfault issue is not a blocker anymore to continue implementing the CI.

dwlsalmeida commented 1 year ago

FWIW, on Arch, we all get the "non-free" stuff by default -> https://gitlab.archlinux.org/archlinux/packaging/packages/intel-media-driver/-/blob/main/PKGBUILD

Gnurou commented 1 year ago

@cazou, before we can merge this we will need you to fix the CLA situation - could you check https://github.com/chromeos/cros-codecs/pull/48/checks?check_run_id=16226713377 ?

cazou commented 1 year ago

@Gnurou @dwlsalmeida The failures are due to the missing action secrets on this project. We need to add, as secrets:

Gnurou commented 1 year ago

The failures are due to the missing action secrets on this project.

Ok, so can we prevent the LAVA tests from being run in the Github actions for now? We will enable them with the CL that adds the necessary tokens.

Besides the submit jobs seem to be failing for unrelated reasons:

Error: Unable to resolve action `dorny/test-reporter@v1.7`, unable to find version `v1.7`
cazou commented 1 year ago

The failures are due to the missing action secrets on this project.

Ok, so can we prevent the LAVA tests from being run in the Github actions for now? We will enable them with the CL that adds the necessary tokens.

They can, yes. But that's basically all there is in this PR. Can't we just merge this when the secrets are set ?

Besides the submit jobs seem to be failing for unrelated reasons:

Error: Unable to resolve action `dorny/test-reporter@v1.7`, unable to find version `v1.7`

Ha yes, I'll fix it. Upstream uses tags instead of a branch, so the whole version number is needed.

Gnurou commented 1 year ago

Hi @cazou,

They can, yes. But that's basically all there is in this PR. Can't we just merge this when the secrets are set ?

I'm fine with that but have to admit I am a bit oblivious as to what secrets we need to obtain, how, and where to set them (as they are "secrets" I suppose we shall not push them into this repository). Could you enlighten me a bit about that?

cazou commented 1 year ago

Hi @cazou,

They can, yes. But that's basically all there is in this PR. Can't we just merge this when the secrets are set ?

I'm fine with that but have to admit I am a bit oblivious as to what secrets we need to obtain, how, and where to set them (as they are "secrets" I suppose we shall not push them into this repository). Could you enlighten me a bit about that?

Hi @Gnurou !

We need to set those secrets: GH_BEARER: A Github token with read access to the Actions artifacts (This will be publicly visible on LAVA) LAVA_TOKEN: A LAVA (lava.collabora.dev) token to be able to submit jobs

I am currently in the process of getting a token from our LAVA people and we'll be able to give that to you for this repo. It will be stored on this repository as a secret. Only you can set or edit it and nobody can read it. it will be shown as *** in the logs (see https://github.com/cazou/cros-codecs/actions/runs/6510164313/job/17683262845#step:3:1)

The GH_BEARER secret must be a token that you generate for this repository, with the read permission. It will be used by LAVA to recover the built ccdec artifact to be tested.

They can be set in Settings > Secrets and Variables > Actions. There is a Add new repository secret button.

And a variable LAVA_USERNAME: The user name associated with the LAVA token. We'll provide that as well. Doesn't need to be a secret and can be added at the same place as Action Secrets, in the variables tab.

obbardc commented 1 year ago

@cazou Can you also make the LAVA_RPC2_URL (e.g. https://lava.collabora.dev/RPC2) a variable (non-secret), so that forks could use their own Lava instance ?

cazou commented 1 year ago

Updated the commit as https://github.com/kernelci/kernelci-core/pull/2051 was merged. It now uses the kernelci amd64 bullseye-gst-fluster rootfs

Gnurou commented 1 year ago

Attempting a merge now that the secrets are in - fingers crossed...