Aterfax / relax-intel-rmrr

62 stars 9 forks source link

Possible fail for HPE MicroServer Gen8. HPE Smart Array P222 Controller. Proxmox 7.3-3 host. TrueNAS SCALE VM. #32

Open mjmeans opened 1 year ago

mjmeans commented 1 year ago

HPE MicroServer Gen8. HPE Smart Array P222 Controller. Proxmox 7.3-3 host. TrueNAS-SCALE-22.12.0 VM.

Slightly modified GRUB line: GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on,relax_rmrr iommu=pt intremap=no_x2apic_optout mitigations=auto,nosmt l1tf=full,force"

The P222 configuration and drives were checked in the host using HPE's ssacli prior to installation of your debs. There are no other PCI device passthroughs on the system.

The PCI Device options used advanced, All functions, ROM-Bar and PCI-Express. With these settings the VM boot showed:

SeaBIOS (version rel-1.16.0.0.gd239552ce722-prebuilt.qemu.org)
Machine UUID e0384dcf-f846-4e99-85e0-############
Slot ?? HP Smart Array P222 Controller       Initializing...  ///////

The '/' are actually spinning |\-/, etc. So, the VM has not frozen, but the initialization just didn't complete.

I then changed the PCI Device settings and removed ROM-Bar. The FreeNAS VM boots properly, drives appear, I can create a pool, etc. But I haven't tried to install ssacli into FreeNAS yet to make sure I can manage the array from within FreeNAS.

Is it expected that ROM-Bar should be disabled for HBAs and are there any other ramifications if it's left disabled since the default in Proxmox is to have it enabled?

accessiblepixel commented 1 year ago

I've got a P420 card passed through via Proxmox with a TrueNAS Scale VM - To get it working I had to disable ROM Bar (which stops it trying to show to the VM's BIOS as a boot device) and I've found that it will only work for me correctly if when I pass it through and start the TrueNAS VM it has to be after a cold boot of the host hardware, otherwise TrueNAS can't seem to access or see the controller properly.

Edit: And when passing through the card at that point it causes ILO to max out the fans because it thinks the card has failed - triggering an ILO reset makes it forget it ever saw the card.

djarbz commented 1 year ago

I am experiencing the same issue that the P420i only shows to TrueNAS Scale after a cold boot.

As for the Fan issue, I have the Silence of the Fans custom ILO firmware installed which allows me to adjust the fans manually. I then run this hookscript on VM boot.

#!/usr/bin/perl

# Exmple hook script for PVE guests (hookscript config option)
# You can set this via pct/qm with
# pct set <vmid> -hookscript <volume-id>
# qm set <vmid> -hookscript <volume-id>
# where <volume-id> has to be an executable file in the snippets folder
# of any storage with directories e.g.:
# qm set 100 -hookscript local:snippets/hookscript.pl

use strict;
use warnings;

print "GUEST HOOK: " . join(' ', @ARGV). "\n";

# First argument is the vmid

my $vmid = shift;

# Second argument is the phase

my $phase = shift;

if ($phase eq 'pre-start') {

    # First phase 'pre-start' will be executed before the guest
    # is started. Exiting with a code != 0 will abort the start

    print "$vmid is starting, doing preparations.\n";

    # print "preparations failed, aborting."
    # exit(1);

} elsif ($phase eq 'post-start') {

    # Second phase 'post-start' will be executed after the guest
    # successfully started.

    print "Delaying OCSD Reset\n";
    my $i = 15;
    while($i >= 1){
        print "OCSD reset in $i seconds\n";
        sleep(1);
        --$i;
    }

    print "Triggering OCSD reset for FAN fix.\n";
    my $username = $ENV{LOGNAME} || $ENV{USER} || getpwuid($<);
    print "Executing User: $username\n";
    system('ssh', 'ILOUSER@ILOHOST', '-oKexAlgorithms=diffie-hellman-group14-sha1', 'ocsd reinit');
    print "OCSD reset complete.\n";

    print "$vmid started successfully.\n";

} elsif ($phase eq 'pre-stop') {

    # Third phase 'pre-stop' will be executed before stopping the guest
    # via the API. Will not be executed if the guest is stopped from
    # within e.g., with a 'poweroff'

    print "$vmid will be stopped.\n";

} elsif ($phase eq 'post-stop') {

    # Last phase 'post-stop' will be executed after the guest stopped.
    # This should even be executed in case the guest crashes or stopped
    # unexpectedly.

    print "$vmid stopped. Doing cleanup.\n";

} else {
    die "got unknown phase '$phase'\n";
}

exit(0);
RavenLiquid commented 1 year ago

Can anyone share some more info in how they got it working? I spent a whole day trying to get it to work, but truenas will not show the drives.

And here I get stuck. I can see truenas trying to access the card but I get an error in proxmox (on the command line, not the webui or VM task) and I in truenas I see errors flashing by while booting.

Proxmox: "DMAR DMA READ NO_PASID" for the PCI address of the card and then "PTE Read Access is not set"

In Truenas: Something like: Adapter configuration update refused (IDBR 0x1).

And of course ILO dislikes it and the fans get ready for takeoff.

Is there a step I missed?

accessiblepixel commented 1 year ago

I can't provide any direct insight into the problem, since I had a G8 DL380 and a P420 card... However, I did use this card with TrueNAS.

It required the card being set into HBA/IT mode before it would pass through the disks to TrueNAS.

Once I started up the TrueNAS VM (with rom bar disabled, as you've figured out) it would cause iLO to freak out and ramp fans to 100%.

My work around for that was wait until TrueNAS had started up, and then I'd trigger an iLO reset (either from the web interface or via SSH... I'll attach the basis of the script I used to maybe hack something together yourself)

Although, one thing I will say, I had 6 disks attached to my p420 and sometimes it would end up causing zfs corruption, I think because of limited I/O having 6 disks trying to fight over one slot's bandwidth.

I no longer use my HP machine for TrueNAS (built a dedicated box for it, and use the motherboard's sata and 4 drives on a LSI-9211 card)...

Either way, good luck! :)

#!/bin/bash
################################################
# onResetSetFanSpeed.sh by jcx - GPLv3 Licence
# This for setting the fan profile on system boot for iLO4 HP Systems once you've figured out what you need to set.
# The examples listed here are for my G8 DL380, but you'll need change them for your system.
#
# Get the iLO4 Custom Firmware from: https://old.reddit.com/r/homelab/comments/hix44v/silence_of_the_fans_pt_2_hp_ilo_4_273_now_with/
# I've only tested with 2.73, but it should work with the modified versions up to 2.77
#
# To add to start up, add to root's crontab with
# @reboot sleep 30 && /bin/bash /path/to/onResetFanSpeed.sh >> /tmp/onResetFanSpeed.log
# ILO user needs 'Configure iLO Settings' privilege to be able to reset iLO
###############################################
###############################################
# Variables
###############################################
runtime=$(date)
PASSWORD="hunter2"
USERNAME="fans"
ILOIP="192.168.60.99"
SSHOPTS="-oKexAlgorithms=+diffie-hellman-group14-sha1"

###############################################
# Some defined standard fan speed PWM values
###############################################
FANSPEED10="26"
FANSPEED20="51"
FANSPEED25="64"
FANSPEED30="77"
FANSPEED35="90"
FANSPEED40="102"
FANSPEED45="112"
FANSPEED50="128"
FANSPEED60="153"
FANSPEED70="179"
FANSPEED80="204"
FANSPEED90="230"
FANSPEEDMAX="255"
SPACER="===================================================="
echo $SPACER
echo "= $runtime Setting HP Fan Curves To Defaults (On Boot)"
echo $SPACER
echo ""
##############################################
# Change these to your new fan curve
##############################################
echo $SPACER
echo "= Resetting iLO..."
echo $SPACER
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'reset /map1'
echo $SPACER
echo "= Waiting 90 seconds for iLO to restart..."
echo $SPACER
    sleep 30
echo "... waited 30 seconds..."
    sleep 30
echo "... waited 60 seconds..."
    sleep 30
echo "... waited 90 seconds..."
echo ""
echo $SPACER
echo "= Current fan information..."
echo $SPACER
    sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan info g'
echo $SPACER
echo "= Setting PWM thresholds..."
echo $SPACER
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 34 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 35 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 36 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 54 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 55 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 53 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 50 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 51 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 37 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 56 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 57 lo 3500'
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan pid 52 lo 3500'

##############################################
# Set the minimum fan speeds here:
##############################################
echo $SPACER
echo "= Setting minimum fan speeds..."
echo $SPACER
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 0 min' $FANSPEED30
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 1 min' $FANSPEED30
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 2 min' $FANSPEED25
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 3 min' $FANSPEED25
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 4 min' $FANSPEED25
        sshpass -p $PASSWORD ssh $SSHOPTS $USERNAME@$ILOIP 'fan p 5 min' $FANSPEED25
finishtime=$(date)

echo $SPACER
echo "$finishtime - Finished setting fan settings."
echo $SPACER
RavenLiquid commented 1 year ago

@accessiblepixel How exactly did you set the mode? I can't find a way to change it. Is it in the option rom where you create the raid array? I also don't know what mode it is in at the moment, it doesn't seem to list it anywhere or is it under a different name?

accessiblepixel commented 1 year ago

I'm not sure about the P222 card, but for the p420, I used the Proliant Service Pack, and the instructions on https://hardforum.com/threads/hp-dl380p-gen8-p420i-controller-hbamode.1852528/post-1041477482

Which if you go into the terminal in the service pack and run hpssacli controller slot=0 modify hbamode=on forced

But I'm not sure if the P222 is supported with this.

RavenLiquid commented 1 year ago

I managed to get it in HBA mode but no dice with passthrough. Also tried both truenas scale and core. No difference. Different kernels form 5.15.39, 5.5.107 and the latests 6.2.11 supported here.

It is getting a little iffy with the older kernels as the patches don't mention proxmox 7.4.3 but 7.2, don't know if that is a big deal.

Truenas scale doesn't display much readable info about the P222 on boot, but using lspci I can see it is detected but no driver is loaded (it does display the kernel module as hpsa so, it does detect it).

Out of desperation I setup unraid in a VM, same deal. Displays the device with no driver loaded. It did display hpsa failed to enter simple mode, but other than some ESXi related posts it did not yield more usefull data.

I did notice that turning on ROM Bar no longer displays the initializing message now. Don't know if that is a good or bad sign. And proxmox now displays a DMA Write instead of read error (same message otherwise).

Bought the machine for this whole usecase so I'm getting pretty frustrated about the lack of information on the errors I keep seeing.

Edit: Been at this for the best part of a week now... No luck other than finding semi related issues with no solution that works.

Booted unraid instead of proxmox and tried it from that direction and still no luck. Had to enable relax_rmrr which was included by default to get the VM to boot and that got me to the same place as proxmox.

I did find the exact thing I'm seeing in my VM's but the person in question uses ESXi and I don't see what should have fixed his install that I can use: https://forums.unraid.net/topic/112594-solved-not-able-to-see-disks-when-unraid-is-under-esxi/#comment-1025128

It seems my best option is to run either Unraid or Truenas scale bare metal and use that as hypervisor...

Other than installing EXSi and check if that does work, but it seems like a lot of effort for just checking (and I ran out of usable drives without losing my proxmox install).