Syllo / nvtop

GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
Other
8.06k stars 292 forks source link

Implement a lookup table to identify newer AMD GPUs #190

Closed paklui closed 1 year ago

paklui commented 1 year ago

We have implemented some changes to help identifying some of the newer AMD Instinct GPUs that we work on. The issue is that some of the newer GPUs were getting identified, because libdrm that we relied on to get the GPU names which usually comes from OS but most of the popular OS distros used today didn’t have the latest list. We implemented a look up table in case the GPU weren’t being identified for the AMD GPUs.

zhuyifei1999 commented 1 year ago

(Nitpick) I'm wondering, can these IDs be pre-parsed at, say, build time rather than at run time?

hliuca commented 1 year ago

Hi @zhuyifei1999, I checked the libdrm. It has been implemented the same way, as it reads amdgpu.ids at run time. The advantage is that it searches the database (IDs), and it supports all listed GPUs in the database. Also, the marketing name check is put in static info, which should be checked once. By putting the marketing name checking at run time, it allows us to compile once and supports all listed GPUs on a LinuxDistro. What do you think? Thanks.

zhuyifei1999 commented 1 year ago

It has been implemented the same way, as it reads amdgpu.ids at run time

I think the difference between libdrm and here is that libdrm reads a separate file whereas here they are hardcoded.

What I mean by pre-parsed is like, the code here

static const char * amdgpu_ids[] = {
    "1309,  00, AMD Radeon R7 Graphics",
    "130A,  00, AMD Radeon R6 Graphics",
    "130B,  00, AMD Radeon R4 Graphics",
    "130C,  00, AMD Radeon R7 Graphics",
    /* more entries */
};

static const char * parse_one_line(uint32_t asic_id, uint32_t pci_rev_id, const char *line)
{
    /* the massive parse function */
}

const char * amdgpu_parse_marketing_name(struct amdgpu_gpu_info *info)
{
    size_t len = 0;
    int line_num = 1;
    int i;
    int ntypes = sizeof(amdgpu_ids) / sizeof(amdgpu_ids[0]);
    const char *name = NULL;

    if (!info)
        return name;

    for (i = 0; i < ntypes; i++) {
        name = parse_one_line(info->asic_id, info->pci_rev_id, amdgpu_ids[i]);

        if (name)
            break;
    }

    return name;
}

is practically equivalent to:

struct amdgpu_id_struct {
    uint32_t did;
    uint32_t rid;
    char *name;
};

static const struct amdgpu_id_struct amdgpu_ids[] = {
    {0x1309, 0x00, "AMD Radeon R7 Graphics"},
    {0x130A, 0x00, "AMD Radeon R6 Graphics"},
    {0x130B, 0x00, "AMD Radeon R4 Graphics"},
    {0x130C, 0x00, "AMD Radeon R7 Graphics"},
    /* more entries */
};

const char *amdgpu_parse_marketing_name(struct amdgpu_gpu_info *info)
{
    int ntypes = sizeof(amdgpu_ids) / sizeof(amdgpu_ids[0]);
    int i;

    if (!info)
        return NULL;

    for (i = 0; i < ntypes; i++) {
        if (amdgpu_ids[i].did != info->asic_id)
            continue;

        if (amdgpu_ids[i].rid != info->pci_rev_id)
            continue;

        return amdgpu_ids[i].name;
    }

    return NULL;
}

You save up on all the string operations, memory allocations, and pointer arithmetic. This not only runs much faster, it also much simpler and makes it easier to reason that it won't cause memory corruption / leak.

The same thing can't be said for libdrm because it is parsing an external file at runtime, instead of parsing its internal hardcoded data.

hliuca commented 1 year ago

Hi @zhuyifei1999, thanks for the advice. We choose to put the hardware info in a header file, so when nvtop is distributed, we don't need to copy the database file, amdgpu.ids, which is also hardcoded and has to be updated after new GPU release. As a Linux Distro may choose old version libdrm, it may not get correct marketing name for latest GPUs. Some old Linux Distro may not update libdrm, and in this way, marketing name is always wrong for new GPUs. This pull request is hoping to by-pass Linux kernel and Distro to provide a much quicker way to support new GPUs.

If you look at the driver libdrm source code, it parses each line too, which has memory allocation, and string manipulations. If you have concern about memory corruption / leak, one easiest modification is to provide a char array as buffer to libdrm and this pull request. I would not worry about performance, as the marketing name query happens once for each GPU, and our computer can parse millions of lines per second. :-)

Thanks again and we will send another pull request after changes.

Syllo commented 1 year ago

Hello, Thanks @paklui and @amdsc21 for posting that after our offline discussion. Thank you @zhuyifei1999 for the feedback on the patch.