fifteenhex / crappy96corearmserverexperiments

Messing around with a 96 thunderx server for the lulz
GNU General Public License v3.0
9 stars 0 forks source link

Onboard SATA #1

Open alexarda opened 1 year ago

alexarda commented 1 year ago

Hi fifteenhex

Thanks for this, I managed to boot one of these boards with an ATX PSU by following this repo.

Did you happen to check whether your SATA ports worked directly from the board? I can't get them to work on mine, and I don't have the backplane to test if its a direct connection issue.

fifteenhex commented 1 year ago

Hi,

Nice. I didn't expect anyone to actually find this useful.

SATA I haven't tried. I'm using an NVMe SSD but in a USB3 adapter thing. If I find some time over the weekend I'll try hooking up a SATA drive and see what happens.

alexarda commented 1 year ago

Please disregard this. It was a rookie error on my part. I attached another drive and it worked perfectly.

It's stupid because the other drive that didn't work is known good, and I switched out both power and data cable with no result.

But then I tried another drive and it was detected.

Apologies for wasting your time and thanks for documenting everything!

fifteenhex commented 1 year ago

Nice. I wonder what the difference is? My long term plan was to work out how to get an NVMe drive to work properly in the PCIe slot.

alexarda commented 1 year ago

Maybe a vendor thing? The non-working drive is an Intel - quite an old one too.

Interesting, an Optane NVMe drive also has issues. The functional vendors for me are Adata for SATA and Samsung for NVMe.

fifteenhex commented 1 year ago

Maybe a vendor thing? The non-working drive is an Intel - quite an old one too.

Interesting, an Optane NVMe drive also has issues. The functional vendors for me are Adata for SATA and Samsung for NVMe.

kioxia nvme didn't work for me. It looks like it's working initially and then stops responding. So the machine boots but eventually goes crazy because it can't read/write data anymore.

I'll try to find another nvme and give it a go. I assumed that the pcie was just garbage. :)

alexarda commented 1 year ago

Screenshot from 2023-05-29 06-16-55

If you're interested I managed to get CentOS Stream reasonably stable with an AMD GPU, which is nice.

no2chem commented 1 year ago

Hi there, thanks for all of this, it's been pretty useful.

I purchased a gigabyte R270-T60 on eBay hoping to use it as a NAS. Your tips helped - I'd add that arm-smmu.disable_bypass=n is needed with new kernels.

Have you been able to get normal hard drives working? I can't seem to get some Seagate exos drives working, at least with the backplane it came with.

ayakael commented 10 months ago

@no2chem You must've gotten that R270-T60 from the same seller as I did (erik something something). I gave up on SATA. Every one of my drivers gave me some variation of ata1: SATA link down or link online but 1 devices misclassified, so I put in an LSI HBA. Fortunately, the backplane even does SAS! That said, nvme has worked as long as I do not set acpi=force like I've seen in few other guides.

no2chem commented 10 months ago

@ayakael Actually, I figured out the SATA - the issue appears to be that the SATA ports don't support spread spectrum. WD drives work fine, but Seagate drives require disabling spread spectrum via seatools - you need to plug it into another device that supports spread spectrum and disable it. Works reliably after disabling spread spectrum.

ayakael commented 10 months ago

@no2chem Many thanks, I'll try that out! By the way, when you do lspci, do you see any Crypto acceleration device on your server?

no2chem commented 10 months ago

I don't, is there a specific accelerator you're looking for? I see

lspci | grep accelerator

0000:00:09.0 Processing accelerators: Cavium, Inc. THUNDERX Random Number Generator (rev 09)
0000:00:09.1 Processing accelerators: Cavium, Inc. THUNDERX Random Number Generator virtual function (rev 09)
0000:03:00.0 Processing accelerators: Cavium, Inc. THUNDERX Zip Coprocessor (rev 09)
0000:04:00.0 Processing accelerators: Cavium, Inc. THUNDERX DFA (rev 09)
000a:00:09.0 Processing accelerators: Cavium, Inc. THUNDERX Random Number Generator (rev 09)
000a:00:09.1 Processing accelerators: Cavium, Inc. THUNDERX Random Number Generator virtual function (rev 09)
000a:03:00.0 Processing accelerators: Cavium, Inc. THUNDERX Zip Coprocessor (rev 09)
000a:04:00.0 Processing accelerators: Cavium, Inc. THUNDERX DFA (rev 09)
ayakael commented 10 months ago

RIght, some ThunderX systems have crypto accelerators, which isn't my case. I'm now trying to get an Intel QAT accelerator going, but I'm experiencing MSI-X related errors. Apparently, MSI-X is broken on my system, trying to debug what's wrong. The more I use this server, the more I realize that a bunch of stuff is just plain broken.

keixthb commented 1 month ago

Hi fifteenhex

Thanks for this, I managed to boot one of these boards with an ATX PSU by following this repo.

Did you happen to check whether your SATA ports worked directly from the board? I can't get them to work on mine, and I don't have the backplane to test if its a direct connection issue.

Hey @alexarda, could you explain how you got this to boot from a ATX PSU? I just picked up one of these devices and was surprised it did not come with the standard connectors.

alexarda commented 1 month ago

@keixthb yes, absolutely. So the pinout can be found in another issue thread here.

I bought one of these adapters from ModDIY and repined it.

I did just use the 5vsb from the ATX PSU rather than try and get 12vsb but it seems to work...

keixthb commented 1 month ago

@alexarda ok, I just ordered one, thanks so much. I'll see if i can get it to work and post an update in the next few weeks or so

keixthb commented 2 weeks ago

@alexarda How did you re wire the power cable, and did you use 2 of them on the machine? I wired mine following the 18 pin diagram and it didn't boot, but I only purchased one... I must be doing something wrong. This is where I'm at with the project:

power_cable

node_with_fans

...also, I designed a fan bracket for a 120mm if anyone wants a copy of the .stl.

capg_bracket

Thanks!

keixthb commented 2 weeks ago

@alexarda I tried connecting pin 10 (on the 18pin) to the pin 9 for the +5vsb, and the lights come on, but I don't get anything on the vga out.

Screenshot 2024-10-08 at 8 42 56 AM

alexarda commented 2 weeks ago

Hi @keixthb, sorry I’ve been a bit lax with this. This went into storage after I bought an Ampere system. I can dig this out of storage for you this weekend. From memory serial is the best option to start tinkering. VGA works but much later in the boot process and you might catch issues on serial you wouldn’t otherwise see

keixthb commented 2 weeks ago

@alexarda No worries. I would appreciate that, thank you! Ideally, my goal is to install RHEL 9 on the system so I can test some of the Nvidia cards (centos is good too though). I know the tesla line is sensitive with firmware so I'd like to install the driver for a bunch and see which ones work--and which ones don't.

alexarda commented 2 weeks ago

@keixthb this is my version

IMG_4390

It comes from 1 x 24 pin ATX and 1 x 8 pin EPS.

Pinout is all 12v and GND as in the other thread EXCEPT I use 5vstb from the ATX supply on pin 10.

Bit difficult to show broken out, but here you go:

IMG_4387

IMG_4388

alexarda commented 2 weeks ago

That single heat shrunk cover cable is a dumb error. It's meant to be jumper to power the ATX supply on PS_On to GND. I thought I could get the board's power button to work, so I depinned it and in so doing broke the pin.

I never got further, but when I need to power the board it goes to pin 11 of the front panel header. Screenshot from 2024-10-11 10-22-41

keixthb commented 2 weeks ago

@alexarda I got it to turn on with the cable, I'll work on the serial next. How did you get the centos 9 kernel installed? Did you use a aarch64 dvd iso and boot from a USB stick?

alexarda commented 2 weeks ago

From memory, yes. This is from my notes at the time:

CentOS: Tested =

acpi=force --> install wont display correctly, fail

acpi=force pci=noaer pcie_aspm=off --> install wont display correctly, fail

acpi=force pci=noaer pcie_aspm=off modprobe.blacklist=ast --> failed at startx after install

acpi=force pci=noaer pcie_aspm=off console=ttyS0 --> install success!

After install changed to = acpi=force pci=noaer pcie_aspm=off modprobe.blacklist=ast amdgpu.dpm=0 --> Working!

Then: systemctl set-default graphical.target

Poweroff, insert AMDGPU

keixthb commented 2 weeks ago

@alexarda Thank you! Okay, so you installed the linux kernel first--then the gpu. I ordered a serial cable for the machine that should be here today. I'll try it with acpi=force , pci=noaer, pcie_aspm=off, modprobe.blacklist=ast and see what happens.

keixthb commented 2 weeks ago

So, using sudo minimum -D /dev/ttyUSB0, I configure with:

Screenshot 2024-10-13 at 2 48 38 PM

Which yields the following when the gigabyte server turns on:

Screenshot 2024-10-13 at 2 43 28 PM

I downloaded the ARM64 (aarch64) version of centos stream 9 here, and flash a USB drive using balena etcher:

Screenshot 2024-10-13 at 2 57 21 PM

I insert the usb stick into the machine as shown:

Screenshot 2024-10-13 at 3 27 08 PM

And the result is the same on the serial:

Screenshot 2024-10-13 at 3 25 23 PM

alexarda commented 2 weeks ago

This might be a hardware/firmware issue. Possibly ram at that point. It should boot and drop you to an EFI shell.

I don’t think I captured a boot log. But I might get a chance to do so later this week if it would help

keixthb commented 1 week ago

@alexarda that would be great, thank you. I will work on it again and see if i can figure out what's going on my end. I think I have some ram sticks I can pull from another working machine to test with.

keixthb commented 1 week ago

@alexarda Can you send a picture of the memory you are using with your machine? I tried two separate sets and I get the same memory controller error.

alexarda commented 1 week ago

@keixthb boot log as mentioned: MT70-HD0_BootLog.txt

Note the very different BDK versions. On your screenshot, it looks like it stalls at the point in my log where we can see BMC IP: N/A, which is immediately followed by the ram configuration/testing/training info.

My suspicion is still a ram issue, but I wouldn't rule out some weird BMC thing too. These motherboards are a total mess...

The part numbers of my ram modules are all M393A1G43DB0-CPB - that's in the log too but here also for posterity.

keixthb commented 4 days ago

@alexarda Thanks, I just ordered the memory. I'll update once it arrives.