ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
328 stars 99 forks source link

pp_force_state cannot work on my machine. #96

Closed andyzhanged closed 2 weeks ago

andyzhanged commented 4 years ago

hi,my gpu card is vega20 and my machine info as following

zhanged@dcu:~$ lsb_release
No LSB modules are available.
zhanged@dcu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic
zhanged@dcu:~$ cat /proc/version
Linux version 5.0.0-23-generic (buildd@lgw01-amd64-030) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019
zhanged@dcu:~$ dkms status
amdgpu, 3.3-19, 5.0.0-23-generic, x86_64: installed
zhanged@dcu:~$ zhanged@dcu:~$ lsb_release
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic
zhanged@dcu:~$ cat /proc/version
Linux version 5.0.0-23-generic (buildd@lgw01-amd64-030) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #24~18.04.1-Ubuntu SMP Mon Jul 29 16:12:28 UTC 2019

When i try to set pp_state by sysfs pp_force_state, i find it do not work. I debug the code, find amdgpu_dpm_get_pp_num_states(adev, &data) return -22(maybe my card not support), but the code still go on instead of return before. so i think we should return there and report fail to user.

static ssize_t amdgpu_set_pp_force_state(struct device *dev,
        struct device_attribute *attr,
        const char *buf,
        size_t count)
{
    struct drm_device *ddev = dev_get_drvdata(dev);
    struct amdgpu_device *adev = ddev->dev_private;
    enum amd_pm_state_type state = 0;
    unsigned long idx;
    int ret;

    if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev))
        return -EINVAL;

    if (strlen(buf) == 1)
        adev->pp_force_state_enabled = false;
    else if (is_support_sw_smu(adev))
        adev->pp_force_state_enabled = false;
    else if (adev->powerplay.pp_funcs->dispatch_tasks &&
            adev->powerplay.pp_funcs->get_pp_num_states) {
        struct pp_states_info data;

        ret = kstrtoul(buf, 0, &idx);
        if (ret || idx >= ARRAY_SIZE(data.states))
            return -EINVAL;

        idx = array_index_nospec(idx, ARRAY_SIZE(data.states));

        amdgpu_dpm_get_pp_num_states(adev, &data);
        state = data.states[idx];

        ret = pm_runtime_get_sync(ddev->dev);
        if (ret < 0)
            return ret;

        /* only set user selected power states */
        if (state != POWER_STATE_TYPE_INTERNAL_BOOT &&
            state != POWER_STATE_TYPE_DEFAULT) {
            amdgpu_dpm_dispatch_task(adev,
                    AMD_PP_TASK_ENABLE_USER_STATE, &state);
            adev->pp_force_state_enabled = true;
        }
        pm_runtime_mark_last_busy(ddev->dev);
        pm_runtime_put_autosuspend(ddev->dev);
    }

    return count;
}
fxkamd commented 4 years ago

Looks like powerplay is not enabled on your GPU. Can you post a full dmesg log?

andyzhanged commented 4 years ago

powerplay is enabled. Other powerplay related sysfs power_dpm_force_performance_level,pp_power_profile_mode, pp_dpm_dcefclk works well. The pp_dpm_get_pp_num_states will return at line "if (!hwmgr || !hwmgr->pm_en ||!hwmgr->ps)" .

static int pp_dpm_get_pp_num_states(void *handle,
        struct pp_states_info *data)
{
    struct pp_hwmgr *hwmgr = handle;
    int i;

    memset(data, 0, sizeof(*data));

    if (!hwmgr || !hwmgr->pm_en ||!hwmgr->ps)
        return -EINVAL;

    mutex_lock(&hwmgr->smu_lock);

    data->nums = hwmgr->num_ps;

The reason it that api psm_init_power_state_table will return cause hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL. So some member of hwmgr is not initialized.

int psm_init_power_state_table(struct pp_hwmgr *hwmgr)
{
    int result;
    unsigned int i;
    unsigned int table_entries;
    struct pp_power_state *state;
    int size;

    if (hwmgr->hwmgr_func->get_num_of_pp_table_entries == NULL)
        return 0;

    if (hwmgr->hwmgr_func->get_power_state_size == NULL)
        return 0;

    hwmgr->num_ps = table_entries = hwmgr->hwmgr_func->get_num_of_pp_table_entries(hwmgr);

    hwmgr->ps_size = size = hwmgr->hwmgr_func->get_power_state_size(hwmgr) +
                      sizeof(struct pp_power_state);

So maybe i guess vage20 is not support by now.

ppanchad-amd commented 2 months ago

@andyzhanged Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd commented 2 weeks ago

@andyzhanged Closing ticket. Please feel free to re-open issue if you still encounter the issue with the latest ROCm. Thanks!