Closed arki05 closed 3 years ago
I don't actually know what the addresses printed in dmesg
are, although it would make sense for them to be physical PCI addresses.
However I really doubt that the kernel will log the regions mapped by the ioremap
syscall into dmesg
. I found these addresses by running an MMIO-trace. If you do that you can examine all the MAP
commands which correlates to the ioremap
syscalls.
Anyway, I do believe that the hardcoded addresses in this region may be an issue for graphics cards that are not Pascal based. Unfortunately I do not have any other graphics card to test with. Do you have a graphics card that doesn't work with these addresses?
First of all you seem to have much more knowledge about these things, if you have any good sources for learning more about this stuff, I would appreciate it if you could share them with me.
Secondly I would probably clarify what I meant above. I tried your vgpu_unlock with a 1070 on a AMD 2950x system.
The vgpu_unlock_hook didn't work. It seemed like the physical memory region was different. When comparing the nvidia-debug-report
you posted previously with my own logs, that at my Physical PCI address range seemed to be different. So I changed
VGPU_UNLOCK_MAGIC_PHYS_BEG (0xf0029624)
to VGPU_UNLOCK_MAGIC_PHYS_BEG (0x4810029624)
(and the same for the key line) and then everything worked.
How or why this happened, but i think some logic to automatically determine this range would be nice.
You know what, that's pretty good information, I feel like it should be included in the README so that AMD users, more specifically AMD Zen users can get help for setting this software up on their machines. Great findings!
I might be having the same issue. I'm not sure how to check the logging, but it's not working on my system. Specs:
Looked through the dmesg like @arki05 and found this line which is similar to his
pci 0000:c1:00.0: reg 0x1c: [mem 0x18010000000-0x18011ffffff 64bit pref]
After changing the magic to VGPU_UNLOCK_MAGIC_PHYS_BEG (0x1800029624)
it unfortunately still doesn't work.
I'm not sure if this is even the issue, but it's the only relevant problem I could find
The fix only works if you are certain that it's a memory address range issue. Are you trying to emulate an RTX A40? That card seems to be a fair bit different as well, but that shouldn't be a limiting factor as we have seen with the GTX 1060 running this script.
Do you happen to have an intel system to try this on? My intel based system didn't require an address change. I'm assuming the 10GB VRAM may require a slightly higher memory range though.
I do have an Intel i3-9100F that I could spin up tomorrow if that's even going to work?
In vgpu_unlock_hooks.c there is a section for enabling logs. You have to change the 0 to a 1
/* Debug logs can be enabled here. */
if 0
define LOG(...) printk(__VA_ARGS__)
else
define LOG(...)
endif
Try to enable logs and rebuild & reinstall the DKMS module. Reboot and post you dmesg / logs. That might help find the issue. You can add additional log() statements to vgpu_unlock_hook.c if you are not sure wether some function is executed correctly / with which parameters it gets executed.
Already enabled the logs, but I'm not sure where to find them. I can post the dmesg here tomorrow for sure
It is likely the same issue. The addresses printed in dmesg is the PCI BAR (Base Address Register) setup by the kernel, for more information on how that works see this wikipedia article: PCI configuration space.
What we are interested in is BAR 3 (documented here) which maps the cards VRAM onto the PCI bus. From my understanding some code is being written into the cards VRAM using this mapping, then the card's Falcon microprocessor is used to execute that code which generates the magic and key values, and the those can be read back by the driver.
Unfortunately I believe that the different generations of cards has different versions of the Falcon microprocessor, so the code used might not be the same. It is therefore also likely that the offset into BAR 3 will have to be different for the different generations of cards.
As far as I know vgpu_unlock has only been tested on Pascal (10-series) graphics cards.
If anyone is interested in providing additional log files for analysis, I would like MMIO-traces for the execution of nvidia-smi
on different cards. Instructions for generating these logs can be found here. If your system contains sensitive information, you might want to filter out all accesses not related to the GPU PCI devices.
RTX 30 series has resizable BAR support, so I would assume that this new generation of GPU uses a greater memory space than previous cards. There should be a way to get a beginning and end value for that space, so would it work if you plugged those values into the script?
It would, if you knew the offset of the magic and key values, and those offsets were constant.
Alright, just ran the mmio trace and the dmesg. The MMIO trace was apparently really difficult or something, because I don't think I got it to work properly. nvidia-smi
just returned "No device found" while it worked normally in Ubuntu. There were some other quirks as about every step of the guide just went different. I've uploaded it anyway and hopefully its still helpful
dmesg_3080.log mmiotrace_3080.log
Will test my Intel system next
Unfortunately it doesn't look like the memory regions that I am interested in was accessed during the recording of that log file. This is likely related to nvidia-smi
showing "No device found", the device should still be listed even if MMIO-trace is runnig.
We can list NVIDIA devices found by mmiotrace (annotations and formatting added for readability):
$ grep -e "PCIDEV .*10de" mmiotrace_3080.log
PCIDEV c100 10de2206 7d fa000000 1800000000c 0 1801000000c 0 f001 fb000000 1000000 10000000 0 2000000 0 80 80000
PCIDEV c101 10de1aef 7c fb080000 0 0 0 0 0 0 4000 0 0 0 0 0 0 snd_hda_intel
^pciid ^bar0 ^bar1 ^bar2 ^bar3 ^len0 ^len1 ^len2 ^len3
The first device is the RTX3080 GPU (pci dev id 0x2206) which we are interested in and the second device is an audio device (probably for sound over HDMI) which is not interesting. We can see that there are three initialized bars on the GPU, BAR0, BAR1 and BAR3.
We can now look at all mapping commands:
Here we can see that BAR0 is mapped five times (id 1, 2, 3, 36 and 57), but BAR3 is never mapped. Unfortunately it is the values inside BAR3 that I am interested in.
Documentation for MMIO-trace, including the log file format can be found here.
Hmm, I'll give it a try again then. I've also tried running the 3080 on Intel, but no luck there either. The PCI id was 0000:01:00.0 instead of 0000:c1:00.0, but it didn't change a thing. It might not be a PCI address issue afterall.
One odd thing I did notice was the GPU temperature and power usage being really high on both systems after applying the mod. The 3080 was about 60C after a while and sucked about 160W. This is definitely not normal and only happens using this script.
Screenshot:
The 3080 is rather new, but I know there are implementations for the GA100 chip in driver 450 and 460, but I don't know if they have added GA102 yet, which the RTX A6000 and 3080 have. It might work in the future, though.
Speaking of 450, have you tried out the 450 driver, or is it no longer available for download from the Nvidia Enterprise portal?
The 460 driver supports both the RTX A6000 and the A40. The 450 doesn't, it only supports the A100. I've tried to install it but it would just give me an error (which is obvious I guess)
The 3080 was about 60C after a while and sucked about 160W. This is definitely not normal and only happens using this script.
I've noticed slightly higher idle wattages on mine too, but only 33 watts which is not much since I believe this script was built mostly around Pascal with not much testing done on newer generations like Ampere.
Usually if you are seeing much higher wattages, temps, and fans, that likely means that the driver is unable to properly work with the graphics card. Notice that your GPU is sitting on P0 high-performance power state despite idling. This isn't supposed to happen on normal operation and could mean that the script is disallowing the driver to work as intended.
I'm no expert by any means, but I figure a modified version of the script focussed on Ampere's far greater memory space usage and other quirks of Ampere generation could be made either separately or part of the same script, just that it will only activate upon detection of an Ampere card PCI ID.
And one last thing, this is unrelated but @FIFARenderZ do you plan on purchasing a license for vGPU after your trial license expires for realtime usage? Or are you trying out this setup for tinkering purposes?
I have no idea why the GPU usage would be affected by the script. But the MMIO-trace is equally useful whether or not vgpu_unlock is used. So an MMIO-trace with an unmodified driver and Ampere GPU would be interesting.
Usually if you are seeing much higher wattages, temps, and fans, that likely means that the driver is unable to properly work with the graphics card. Notice that your GPU is sitting on P0 high-performance power state despite idling. This isn't supposed to happen on normal operation and could mean that the script is disallowing the driver to work as intended.
That would make sense, although I haven't seen any other power state mentioned in the nvidia-smi
command.
And one last thing, this is unrelated but @FIFARenderZ do you plan on purchasing a license for vGPU after your trial license expires for realtime usage? Or are you trying out this setup for tinkering purposes?
For tinkering purposes right now. Maybe we're able to do much more later 😉
I have no idea why the GPU usage would be affected by the script. But the MMIO-trace is equally useful whether or not vgpu_unlock is used. So an MMIO-trace with an unmodified driver and Ampere GPU would be interesting.
For sure, which is what I was trying to do. It didn't work out for some reason and I'll give it another try tomorrow
Tried another round of MMIO-tracing with no success. The driver works normally when booted, but once I start tracing (and disabled & enabled the driver) it just spits out "No device found". Checked the logs and it again contained no info about BAR3. @DualCoder Did you do anything different from the guide? Because I'm kind of at a loss right now
PS: Maybe I'm doing something wrong, but every time I execute echo nop > /sys/kernel/debug/tracing/current_tracer
it returns a Device or resource busy
. I've followed both guides completely and tried it in both recovery mode and normal mode
If anyone wants to join, https://discord.gg/mAz38ZBrjx
I have created one, if anyone wants to join, https://t.me/gpuhacking
We usually use the EEVBLOG forum to discuss this but their data center went on fire. I joined your telegram but it would be nice if we all can have permission to add messages.
Does the VGPU unlock work with RTX 3090?
In theory it could, but based on @FIFARenderZ's experience with the RTX 3080, it may or may not work out. This script works with older generations like Pascal and Turing though. Also, the 3090 uses the GA102 which is the same as the 3080, so your chances of success are going to be about as high as everyone else with a 3080...
Has been solved by dualcoder in dualcoder/vgpu_unlock@54d90cde
In the readme you wrote "Physical PCI address range 0xf0000000-0xf1000000" in my case the range turned out to be 0x4810000000-0x4811ffffff
The Address Range can be found in dmesg in lines like this: pci 0000:0a:00.0: reg 0x1c: [mem 0x4810000000-0x4811ffffff 64bit pref]
This is Probably due to above 4g decoding (not 100% sure, but my best guess)