Wack0 / maciNTosh

PowerPC Windows NT ported to Power Macintosh systems
GNU General Public License v2.0
522 stars 16 forks source link

HAL crashes during initialisation on Cuda systems #8

Closed Wack0 closed 2 months ago

Wack0 commented 2 months ago

As reported by @JonObst here: https://github.com/Wack0/maciNTosh/issues/3#issuecomment-2227171714

Most likely an issue with PCI bus enumeration in the HAL.

Wack0 commented 2 months ago

Here is a test build that may fix the issue. If not, I'll make a build with all the PCI bus enumeration testing code in... nt_arcfw_grackle_fwonly_test20240714_1248.zip

JonObst commented 2 months ago

Same problem with the most recent build (1536)

Wack0 commented 2 months ago

Hmm, not even displaying an additional message for detecting Yosemite?

I'll make another build with more debug logs as soon as possible.

JonObst commented 2 months ago

Right, just one dot then the freeze, nothing else displayed.

Wack0 commented 2 months ago

OK, here's a build with the PCI debug output present in the HAL:

nt_arcfw_grackle_fwonly_test20240714_1917.zip

JonObst commented 2 months ago

IMG_0147

This is still all I get. Just to confirm, I repartitioned the drive, ran setupldr, added the mass storage devices (2), selected the Mac type and video type, then it hangs here.

Wack0 commented 2 months ago

Looking further, I think the Cuda driver in the HAL may be incorrect. (well, does things different that what I thought)

I made some changes, does this work?

nt_arcfw_grackle_fwonly_test20240714_1952.zip

JonObst commented 2 months ago

Unfortunately it does the exact same as before.

Wack0 commented 2 months ago

Made another change, and also added a bunch of debug output to the cuda init:

nt_arcfw_grackle_fwonly_test20240714_2106.zip

JonObst commented 2 months ago

IMG_0154

well it’s showing the stop error now, progress!

JonObst commented 2 months ago

also I believe it made it further than it had been before the halt

Wack0 commented 2 months ago

yeah, now it is a pci enumeration issue.

Wack0 commented 2 months ago

this one has more PCI debug output, should hopefully say what's causing the machine check exception:

(also removed the cuda debug output because it's no longer needed)

nt_arcfw_grackle_fwonly_test20240714_2204.zip

Wack0 commented 2 months ago

Just noticed something in the linux sources, where on Yosemite it turns off master abort mode in the PCI-PCI bridge, so I added code to do the same thing in the HAL when finding a PCI-PCI bridge - maybe that'll help.

nt_arcfw_grackle_fwonly_test20240714_2237.zip

ActionRetro commented 2 months ago

Also trying on a B&W G3 - version nt_arcfw_grackle_fwonly_test20240714_2237.zip has the same result. IMG_8838

JonObst commented 2 months ago

Yep I get the same as Sean IMG_0156

Wack0 commented 2 months ago

huh, maybe a timing issue on the cuda init?

JonObst commented 2 months ago

I’d say your guess is as good as mine, but your guess is actually 100x better than mine.

joevt commented 2 months ago

When porting pciutils to Mac OS X for PowerPC Macs, I found that probing non-existent devices behind the built-in PCI-PCI bridge of a B&W G3 causes a check exception so now it has a special case where it only probes PCI devices that exist in the device-tree. https://github.com/joevt/directhw/blob/a1d987fdea92ddf19cc877c4acef9b1ca2e06072/DirectHW/DirectHW.cpp#L1222-L1278

Would be nice if there was a flag to disable the machine check exception for this case. Doesn't Grackle have a flag for that? I have to check the MPC106 manual...

Reading from a non-existent device should just return -1 for every byte in the config space (not true for hidden Thunderbolt devices which return -1 only for vendor and device ID).

Wack0 commented 2 months ago

Added some more short delays during cuda init:

nt_arcfw_grackle_fwonly_test20240715_1027.zip

Wack0 commented 2 months ago

@joevt According to the 21154 datasheet, the master abort mode (which linux specifically disables) is responsible for the behaviour you describe:

"Controls the 21154’s behavior when a master abort termination occurs in response to a transaction initiated by the 21154 on either the primary or secondary PCI interface. When 0: The 21154 asserts TRDY# on the initiator bus for delayed transactions, and FFFF FFFFh for read transactions. For posted write transactions, p_serr_l is not asserted. When 1: The 21154 returns a target abort on the initiator bus for delayed transactions. For posted write transactions, the 21154 asserts p_serr_l if the SERR# enable bit is set in the command register. Reset value: 0."

It appears that probe-slots in yosemite's OF does enable this bit.

joevt commented 2 months ago

It appears that probe-slots in yosemite's OF does enable this bit.

Yes. It's a standard PCI-PCI bridge register but this only happens for the built-in DEC21154 PCI bridge - not any other bridges (including other DEC21154 bridges that you may add).

The bridge control register is set to 0x0326 (hard coded) after the slots are probed. 1: SERR# Enable 2: ISA Enable 5: Master-Abort Mode 8: Primary Discard Timer 9: Secondary Discard Timer

The bridge control register is set to 0x03a6 before Open Firmware. 7: Fast Back-to-Back Enable ... + all the other bits mentioned above.

Any other PCI bridges that are not the built-in DEC21154 PCI bridge get their bridge control registers set to 0 before probing and 4 (ISA Enable) after probing.

JonObst commented 2 months ago

IMG_0157

latest build results ^^^

Wack0 commented 2 months ago

ok, is it definitely freezing there? boot should continue after that, either with INACCESSIBLE_BOOT_DEVICE or getting into text setup.

can you try not loading the general HID and storage driver? if it boots to a keyboard error, this would imply more issues in the Cuda driver...

JonObst commented 2 months ago

Fairly certain, the optical drive turned off as it has done in the past when the system freezes and I let it sit there for a couple of minutes with no progress - but I will recheck again after I return home from work this afternoon.

ActionRetro commented 2 months ago

Same result here. IMG_8851

Wack0 commented 2 months ago

@ActionRetro is that with not loading the general HID and storage driver?

ActionRetro commented 2 months ago

Oh no, I'll try that

Wack0 commented 2 months ago

if that boots to a keyboard error, then I know it's a HAL Cuda driver problem, otherwise something else (and probably related to the PCI IDE controller in some way, atapi.sys does load so it will try to use it)

ActionRetro commented 2 months ago

Ok, without HID and storage driver, I got: IMG_8852

Wack0 commented 2 months ago

OK, I meant with the mac i/o ide driver and not the "generic HID and storage" driver; but that actually helps! it shows me that atapi.sys initialiased fine, so it definitely is a cuda driver problem.

In that case, I noticed the linux driver did readback IER after writing to it to disable all MCU interrupts, so I tried that (instead of a delay after setting it on init). Might not work, but I guess linux driver did specifically do that for a reason, so:

nt_arcfw_grackle_fwonly_test20240715_1729.zip

Wack0 commented 2 months ago

Given the previous build did not work (on a trayloading imac), I added some debug output for Cuda, hopefully the crash isn't what I think it is...

(I removed the PCI bus enumeration debug code, as it's not needed anymore)

nt_arcfw_grackle_fwonly_test20240715_1834.zip

TechandMusic462 commented 2 months ago

Debug output on the Trayloader:

PXL_20240715_174618944 MP

Wack0 commented 2 months ago

huh, wasn't expecting that output... so the problem is ADB commands, specifically.

this one will give slightly more debug output.

nt_arcfw_grackle_fwonly_test20240715_1856.zip

Wack0 commented 2 months ago

...I think I figured out what the problem was, how did this ever work, even under emulation??? (forgot to check if cuda was finished sending data, something I implemented in the ARC firmware even!)

I removed all the debug output, I can add it back if this still freezes.

nt_arcfw_grackle_fwonly_test20240715_1913.zip

TechandMusic462 commented 2 months ago

I am now in the installer, will continue the install and report back!

PXL_20240715_182411932 MP

Wack0 commented 2 months ago

I assume the keyboard works fine in text setup?

TechandMusic462 commented 2 months ago

Yes, I am currently at the formatting stage of the install.

Wack0 commented 2 months ago

In that case, this issue is fixed, I'll get a release together.