CUDA SDK 11.6 and newer not supported (fully)

SveSop / nvcuda

Standalone version of nvcuda from Wine-Staging

Other

1 stars 0 forks source link

CUDA SDK 11.6 and newer not supported (fully) #1

Closed SveSop closed 9 months ago

SveSop commented 2 years ago

A part of the CUDA SDK is "cuda runtime api". This is a api used by various software like OptiX and others. https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html#driver-vs-runtime-api

This is fairly straightforward when using, but it creates some rather nasty sideeffects for the nvcuda implementation since apps is not compiled against the Linux version of the CUDART.

This is worked around for various parts in the internal.c part of the nvcuda codebase, and has up until SDK 11.5 been rather trivial (for me) to continue supporting. However as of SDK 11.6 (and now 11.7) it is failing misserably. I have been working my best trying to figure out some stuff here, but i have come up short on that one. Seeing as the code uses void's to implement a "dump-the-needed-data-here" kind of function to get around the needs for this it is not overly easy to figure out what/how the missing/wrong data is being relayed.

Sample code compiled with nvcc and visual studio in windows: (simple test.cu)

#include <iostream>
int main() {
  int n = 0;
  cudaError_t error = cudaGetDeviceCount(&n);
  if(error != cudaSuccess) {
      std::cerr << "Error:  " << cudaGetErrorName(error) << std::endl;
      std::cerr << "String: " << cudaGetErrorString(error) << std::endl;
  }
  std::cout << "Number of devices: " << n << std::endl;
}

This will use the cuda runtime api to do various checks and if passed report how many cuda devices is available.

Result compiled with SDK 11.5:

0128:trace:nvcuda:Unknown7_func0_relay (11050, 0x62f2ad9c, 0x11fc28)
Number of devices: 1
0128:trace:nvcuda:Unknown1_func6_relay (0x14006d018, 0xb803f0)
0128:trace:nvcuda:DllMain (0x7f71288f0000, 0, 0x1)

Same code compiled with SDK 11.6:

0128:trace:nvcuda:Unknown7_func0_relay (11060, 0x62f2b1cb, 0x240370)
0128:trace:nvcuda:Unknown7_func0_relay (11061, 0x62f2b1cb, 0x240380)
0128:trace:nvcuda:Unknown7_func0_relay (11062, 0x62f2b1cb, 0x240390)
Error:  cudaErrorSoftwareValidityNotEstablished
String: integrity checks failed
Number of devices: 0
0128:trace:nvcuda:Unknown1_func6_relay (0x14006e018, 0xb80400)
0128:trace:nvcuda:DllMain (0x7f4da2c80000, 0, 0x1)

nvcuda source lines of interest:

https://github.com/SveSop/nvcuda/blob/devel/dlls/nvcuda/internal.c#L218-L233

and

https://github.com/SveSop/nvcuda/blob/devel/dlls/nvcuda/internal.c#L523-L527

static void* WINAPI Unknown7_func0_relay(int cudaVersion, void *param1, void *param2)

The first one cudaVersion i am fairly sure is just as i figured out - a int providing cuda sdk version (11050 = 11.5). Changing this has no effect. The second parameter: void *param1 is i think some data providing info TO the cuda driver. Changing or otherwise fiddling with this ends up with various wine page faults. The third parameter: void *param2 does seem like it is the result back from the driver/call. This i have dabbled with trying to create various structs/char arrays/int arrays+++ and i do get data that can be manipulated AND actually cause the 11.5 compiled test to fail with the same cudaErrorSoftwareValidityNotEstablished as the 11.6 one. However when this fails doing that, it does NOT cause 3 calls with increasing "cudaVersion" like when it fails with 11.6 version.

So - i am basically at a complete loss here.

Why would i want 11.6 and spesifically 11.7 to work with nvcuda? Well, it seems the default OptiX SDK 7.5 is meant to be compiled against cuda sdk 11.7 - although samples i have tested so far that has been compiled against cuda 11.4 has worked, but things like IRAY or other software utilizing OptiX will probably be compiled "as-it-is-supposed-to" and thus probably will require cuda sdk 11.7 compliant driver and nvcuda.

Attaching pre-compiled version of the cuda test sample above. cuda_samples.zip

Saancreed commented 2 years ago

Oh dear this is hard.

Unknown7_func0_relay does indeed seem to be the source of our issues. The name of reported error, cudaErrorSoftwareValidityNotEstablished seems to suggest that this is some sort of DRM which makes reverse engineering this questionable, but let's say we want to understand what's happening there for the sake of compatibility/interoperability.

First, the cudaVersion parameter seems to be used to sneak in some extra parameter as its least significant decimal digit. I assume that newer CUDA runtime runs some extra integrity checks and expects a different answer for each one, and the one for 2 does not match whatever it expects, causing this failure. Here's a suggestion: if these checks are not shortcircuited then hacking our Unknown7_func0_relay to return a fake result when cudaVersion % 10 == 1 should cause the sample to fail one call earlier. Could you check what happens when we do that?

Second, the param1 parameter is probably not a pointer at all. Across multiple runs, I've noticed that it steadily increases over time, by about 1 every second passed. So, let's try something:

$ WINEDEBUG='-all,+nvcuda' wine cuda_minimal_115.exe |& fgrep Unknown7_func0_relay && date -u
0128:trace:nvcuda:Unknown7_func0_relay (11050, 0x62f2f438, 0x11fc28)
Tue Aug  9 23:56:40 UTC 2022

$ date -ud @$((0x62f2f438))
Tue Aug  9 23:56:40 UTC 2022

Aha! It appears to be just the current UNIX epoch timestamp, so this argument is probably of type closer to uint64_t rather than void*. Otherwise, there's nothing wrong here in particular.

Finally, the param2 parameter seems to stay the same across multiple runs of the same binary (e.g. 0x11fc28 for your cuda_minimal_115.exe attached above). This can mean one out of two things (there could be more but not as likely):

This is always some numeric 64–bit constant. Not too likely, because in such case this function would be able to return only whatever it has declared in its return value as result. Suggestion: log the value that Unknown7_orig->func0 returns to us before returning it to the application. I have a feeling this is actually going to be something like CUresult, so 0x0 on success and other values on failure, which would imply that param2 is not just some numeric constant.
It's a pointer to some global variable which is always mapped at the same memory address, in which the real result is returned. Let's take a look at pointers passed in there:

0128:trace:nvcuda:Unknown7_func0_relay (11060, 0x62f2b1cb, 0x240370)
0128:trace:nvcuda:Unknown7_func0_relay (11061, 0x62f2b1cb, 0x240380)
0128:trace:nvcuda:Unknown7_func0_relay (11062, 0x62f2b1cb, 0x240390)

(They are even exactly the same for 11.7 sample!)

So, we have three memory addresses, exactly 0x10 bytes apart, suggesting that it's either sequentially placed 16–byte arrays or structs, or something smaller but aligned to 16 bytes each. Suggestion: try making a simple test program that just loads libcuda.so, calls cuGetExportTable with Unknown7's ID and repeatedly calls this function, passing as the last argument an address of some large array initialized with some pattern (or just zeros) and observe how it changes. Is the result always the same for the same values of cudaVersion and param1? Only within a single run of the test program or across multiple runs too? How many bytes are written? Does it matter what is in the param2 array before the call? Does the order of calls matter?

Unfortunately, my guess is that calls with …2 passed as cudaVersion perform some more thorough integrity check, most likely backed by some cryptographic signing/encryption algorithm and something is different between Wine and Linux environments, causing the data mismatch. I'm not sure if diving into this is a good idea: if NVIDIA wants to make CUDA unusable in Wine, they have enough resources at their disposal to make our efforts utterly hopeless. Best we can do is ask NV people how are supposed to proceed (unless we're not, in which case, well, there's not much we can do) and maybe we will get some guidance and/or some actual help with this hellscape.

SveSop commented 2 years ago

Thanks for the input. I will study this a bit more to try to understand.

I don't really think it is as sinister as NVIDIA purposely attempting to make nvcuda unusable in wine, cos there would be more ways to do that than f**ing with this internal function IMO 😄 So, i am inclined to give them (NV) the benefit of the doubt in that regard.

Using CUresult ret; and ret = Unknown7_orig->func0(cudaVersion, param1, param2); logging the before & after results it is clear that the first cudaVersion parameter is the same, and has absolutely no effect on the result. Using 11.5 test changing this to whatever "version" does not change the result, so afaik it is just some information.

Trying to TRACE any data from the param1 void is fruitless for me.. just causes exception errors. As to the param2 which i have experimented the most with it seems to be where some "comparable data" is residing. I have been able to put data in an int array, and it did show changes before the ret and differences after the ret. Changing the contents of these results caused 11.5 to give the same cudaErrorSoftwareValidityNotEstablished...

I had a unsigned int data[8] array that looked something like this BEFORE the call:

param2->data[0]: 0
param2->data[1]: 0
param2->data[2]: 32
param2->data[3]: 0
param2->data[4]: 23424434
param2->data[5]: 4432356
param2->data[6]: 87478382
param2->data[7]: 74748882

After the return:

param2->data[0]: 23424434
param2->data[1]: 4432356
param2->data[2]: 87478382
param2->data[3]: 74748882
param2->data[4]: 23424434
param2->data[5]: 4432356
param2->data[6]: 87478382
param2->data[7]: 74748882

(The numbers are just examples - but they are the same each run.. at work, so dont remember the actual numbers)

Now, if i changedparam2->data[0] = 5555 before returning, the 11.5 sample would show the same error.... however changing BOTH param2->data[0] AND param2->data[4] so they are the same, would "pass" this test. So to me it did seem like it does some sort of comparison of this.

The unsigned int array would probably be completely wrong tho as you mention with the 16-byte array.

The weird thing is that when you run this with 11.7, it will NOT contain any data in param2->data[0-7], but after the ret, param2->data[0-3] will have the same values in them as the 11.5 result. Copying 0->4, 1->5 and so on, so the values is comparable the same as when running 11.5 still fails tho.

As you can see from the logs when running the samples it will also gather loads of "parameter" data before running this function, and i fear it MAY be using something gathered there to figure out what values this return is supposed to have when running on >11.5. There is only 1 unimplemented call that is being run that i know of, but have been unable to implement as i don't think it can be implemented unless i make a switch-list containing most of the cu* calls in the lib... and that is no trivial task imo. This was introduced with SDK 11.4, but for all i know might not have been used before cuda runtime api 11.6 perhaps?

It could explain why the "compare values" of param2 is 0... ie. it can't gather the needed data and perhaps fails filling the data struct before doing the call - and thus cudaErrorSoftwareValidityNotEstablished Ill put up a "experimentation" branch with the changes i mean, so you can see it more clear (i suck at explaining things i don't really know much about hehe).

Fun times... fun times indeed 😏

SveSop commented 2 years ago

I had not thought about that 16-byte array thing, cos i created a unsigned int data[8] array, and that i suppose is 32-byte - which might explain why the last 4 positions would contain the same as the "resulting ret" .. maybe with < 11.6, the comparison data is residing in an address 16 bytes > the struct, and thus just printing it would actually just print the data from memory...

They could have changed the "comparison data" to something completely different with > 11.5 i guess..

Anyway, i made a unsigned short data[8] array (supposedly 16 bytes i hope), and it yielded something completely different ofc.

https://github.com/SveSop/nvcuda/commit/9927cc037c5def72fbeb96feb714a5ce9c6619e3

11.5:

0128:trace:nvcuda:Unknown7_func0_relay (11050, 0x62f56bd4, 0x11fc28)
0128:trace:nvcuda:Unknown7_func0_relay Size of table: 16
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue0: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue1: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue2: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue3: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue4: 32
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue5: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue6: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue7: 0
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue0: 26460
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue1: 971
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue2: 6172
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue3: 13121
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue4: 53736
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue5: 8012
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue6: 33706
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue7: 36563
Number of devices: 1

11.7:

0128:trace:nvcuda:Unknown7_func0_relay (11070, 0x62f56bf2, 0x240370)
0128:trace:nvcuda:Unknown7_func0_relay Size of table: 16
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue0: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue1: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue2: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue3: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue4: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue5: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue6: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue7: 0
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue0: 26460
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue1: 971
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue2: 6172
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue3: 13121
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue4: 53736
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue5: 8012
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue6: 33706
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue7: 36563
0128:trace:nvcuda:Unknown7_func0_relay (11071, 0x62f56bf2, 0x240380)
0128:trace:nvcuda:Unknown7_func0_relay Size of table: 16
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue0: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue1: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue2: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue3: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue4: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue5: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue6: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue7: 0
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue0: 26460
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue1: 971
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue2: 6172
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue3: 6209
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue4: 53736
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue5: 8012
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue6: 33706
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue7: 36563
0128:trace:nvcuda:Unknown7_func0_relay (11072, 0x62f56bf2, 0x240390)
0128:trace:nvcuda:Unknown7_func0_relay Size of table: 16
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue0: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue1: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue2: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue3: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue4: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue5: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue6: 0
0128:trace:nvcuda:Unknown7_func0_relay BEFORE - Datavalue7: 0
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue0: 29662
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue1: 33375
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue2: 22854
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue3: 58016
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue4: 18706
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue5: 42217
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue6: 37125
0128:trace:nvcuda:Unknown7_func0_relay AFTER - Datavalue7: 35411
Error:  cudaErrorSoftwareValidityNotEstablished
String: integrity checks failed
Number of devices: 0

Saancreed commented 2 years ago

unsigned short data[8]

I'd just use unsigned char[16] or uint8_t[16].

Call for 11070 appears to return the same result in params2 as the one for 11050 so I assume this one is correct. Likewise, 11071 returns something that the application apparently finds correct because it continues on and only the third call returns something different from what the application expects.

Not sure what we can achieve in Wine with debuggers but you could try placing a conditional breakpoint in Unknown7_func0_relay oncudaVersion % 10 == 2 and then enable WINEDEBUG=all (using Wine's taskmgr.exe but you'd have to start the process with WINEDEBUG already set to something to see anything in the list of debug channels) and attach strace or ltrace to the process to see what kind of information both sides use to compute this value?

SveSop commented 2 years ago

I did not get anything useful out of strace or ltrace, as this did not really output any more data. I dont know how to get the values of a table using that. Can see the calls but.. meh.

Spent lots of time (again) on various tablesizes and still same crap. Well.. maybe someone will be able to figure this out some day :smile:

Saancreed commented 2 years ago

Yeah, it's just very unlikely we get any progress here unless NV literally tells us why this fails or someone takes at look at disassembled code which is… not exactly the best thing to do.

SveSop commented 2 years ago

Yeah, i think we are SOL on that one.. Found a few posts around 10 years ago on the NV dev forums where someone was trying to debug some cuda code he wrote using the runtime api, and found this completely undocumented "internal call" thing. No reply from NV that time tho.

So.. some speculation again.

This param2 thingy uses the same address every time - so logically it is some part of memory containing some struct used as a comparison. So when this "internal call" function happens, does it get data from libcuda.so? I mean.. why make an internal call to yourself to verify data in a struct already in own memory space? Since the return data seem to be the same (the 16 bytes atleast) for 11.6/11.7 as the others that may be data from the driver.. And even tho one can set the data in the next 16 bytes so it is the same, it still fails and does not behave like 11.3/4/5.

I noticed that 11.4 and 11.5 uses the same address, while 11.3 actually uses an address 96 bytes later... so the address space holding the data is somewhere else in memory for the 11.3 runtime (have not tested earlier) vs. 11.4 /11.5... The displayed values seem to be the same tho. Seems odd that suddenly the address should jump 1181512 bytes with the 11.6/11.7 lib tho, so i wonder what would reside in an address closer to that of 11.5....

Anyway, it seems rather hopeless to just try to read random addresses to see if there is some logical data there. Wish i had any 1337 skillz in disassembling :smirk:

SveSop commented 2 years ago

NVIDIA forum post i mentioned nvapi call of this function

I wonder if i could call the cuGetExportTable(const void **table, const CUuuid *id) directly in linux with the uuid {{0xD4, 0x08, 0x20, 0x55, 0xBD, 0xE6, 0x70, 0x4B, 0x8D, 0x34, 0xBA, 0x12, 0x3C, 0x66, 0xE1, 0xF2}} and see what this "table" contains.. and then maybe use this "table[0]" to get some data of sorts.. Hmm.

archlitchi commented 2 years ago

hi, Thanks for this great project! however I doubt that Unknown7_func0_relay is the source of this problem, because I ran my test in a container with image nvci.io/cuda:11.6 and host driver version460.91.03(i.e docker run -it --gpus all nvcr.io/cuda:11.6 bash).

Once you started this container, you will notice there are two libcuda.so in it. One is libcuda.so.460.91 which is mounted by nvidia-docker2, the other one is libcuda.so.510 which is image native. If you run your example with libcuda.so->libcuda.so.510, it will return error 103(cudaErrorSoftwareValidityNotEstablished), however, if you modify env LD_LIBRARY_PATH to let libcuda.so->libcuda.so.460, it will return 0(CUDA_SUCCESS), regardless of which, Unknown7_func0_relay(11062,xx,xx) is the last function invoked in both cases.

If a cuda11.6-compiled case can run normally on a low driver version(ie. 460.93), maybe we can cheat Cuda-runtime to let it see us as a low-version libcuda.so in order to bypass the software validity check, in order to do so, maybe we need to behave the same as low-version libcuda.so in other internal functions, it's still a hard job though:)

SveSop commented 2 years ago

Not really sure why it would run at all, since afaik 11.6 needs 510+ version driver to be supported. Calling Unknown7_func0 will just return 0 since i was not clever enough to set CUresult ret = CUDA_ERROR_UNKNOWN; to avoid such problems. You could try to compile nvucda yourself with that instead, and it should return Error: cudaErrorUnknown since the function simply is not there. Like this https://github.com/SveSop/nvcuda/commit/1d76359dacfac19b845fc8e48ee05709f51ee5c6 The reason this happens is that we create this function regardless of it actually existing in the driver. I would think running wine-staging with the default implementation of nvcuda actually would return cudaErrorUnknown instead of cudaErrorSoftwareValidityNotEstablished, since this is the default return when the UUID does not exist.

What was the last 10 lines of the log when you tested this? Could you post the logs from 11.5 and 11.6 as a comparison with this libcuda.so.460 trickery? :smile:

The problem as i see it is if we return a code indicating "driver not supported" (dont remember the actual one on the top of my head) it still does not actually support something needed. If a CUDA app uses some functions from 11.6 it will simply not run, even if the demo/test files just fly through (talking live app here, and not arbitrary test). Kinda like running a game complaining about "required driver version XXX" and then just not running - or using software emulation of sorts. Does not solve the real issue.

archlitchi commented 2 years ago

Well, guess I know how this software validate works. it has nothing to do with Unknown7_func0_relay. What it does is check content of *table in wine_cuGetExportTable(const void table, const CUuuid id). `You simply can't change the content of table`, if you do, error 103 is returned. It does not check the content of table though, but table reside in the .ro section of libcuda.so, so segmentation fault is returned if you try to mess up with table, it seems a dead end. However, you can add write permission using mprotect(*table&0xfffffffff000,0x1000,PROT_READ|PROT_WRITE), and then change the content of internal function table inside libcuda.so, let it point to your implementation. At least it worked for me. Let me know if you made it, then I will hide this thread, don't want nvidia to see this though:)

SveSop commented 2 years ago

Well, it actually has to do with Unknown7_func0_relay cos if you mess with the result of this table, it WILL fail with the same error on <11.6 aswell. Up until 11.6 it gets 16 bytes of data, that it compares to the next 16 bytes of data in memory. If you change anything in the first 16 bytes - or the next 16 bytes, it will throw the same error. However this does not seem to work for 11.6 for some reason (see last comment below).

Looking at the data returned <11.6, it has 0 in the first 16 bytes, then 16 bytes of data. After the call, the call will return 16 bytes of data that is exactly the same as what resides in the next 16 bytes in memory - result = success. You see this when you run my demo sample compiled with cudaRT 11.5. (Table does not have to be char.. just easier to look at). The 16 bytes returned after doing the call is the same for all 11.x versions i tested, but the memory address of the data changes between 11.3 -> 11.4 (only a few bytes), then drastically between 11.5 and 11.6. And one of my theories could be then that this "huge" change in memory address for this function, probably is not the same for the windows vs. linux version of nvapi.

I would really like to know the actual struct of this function inside libcuda.so as i fear it may be changed somehow vs. <11.6 (hence the huge jump for the memory address), so if you have some sort of info for that, and you are willing to share we can hop onto discord or something if you feel that is better.

SveSop commented 1 year ago

@archlitchi I had actually completely forgotten about your post, but i do have some different theories about this. I am not really sure what you mean by not being able to change the content of *table, as this clearly GETS data after the func0_relay call in all instances - it however seems to compare the returned data with (possibly) a copy of this table containing whatever data it expects. (Might be what you mean.. that this *table is in memory (that cant be changed) and then compared with the data in *table that the return data gets.)

Anyway. The cudart library does seem to do the following: ANY of the cuda functions will run a series of cu calls to "gather data". It does not matter how many cuda* calls you run, it will only do this "check" once. EG. The "DeviceQuery.exe" sample from the cuda-demo source will run things like cudaGetDeviceCount Running this call will run a series of calls to nvcuda like cuGetProcAddress, cuDriverGetVersion, cuGetExportTable, cuDeviceGetAttribute and so on and so forth.. Then if i run cudaSetDevice, it will NOT run this "check" again... however if i make a sample JUST running cudaSetDevice and nothing more, it will still run this check

So.. It seems as ANY (although ofc not tested) of these calls actually run this "check" and probably save this someplace in memory for later use. If however something - unknown what - fails doing this check, the data will be freed, cos just skipping through a sample and output data from lets say cudaGetDeviceProperties(&deviceProp, dev); and deviceProp.totalGlobalMem will just contain garbage data even tho i can easily see the cuDeviceGetAttribute calls in nvcuda getting the correct data.

I have compared and fiddled with this a lot, and compared the cu* calls values with what i get in Windows and Linux vs nvcuda implementation to no avail.

Running CUDA <11.6 shows as i posted above 16 bytes of data in the next 16 bytes of memory compared to this *table but from 11.6 and on, these 16 bytes is just empty.. So i am actually starting to lean towards SOME function failiing BEFORE this "compare table" is actually filled regardless of the func0_relay call.. As pointed out before, the 16 bytes returned in the *table from func0_relay call is the same for all the 11.x and 12.x samples i have tested, so i cant think anything else than this is SOME sort of data that is correct. The amount of various things i have tested with this table that checks out when running 11.5 cant make me believe anything else, so i think it is SOME other thing that fails - The result being: All the gathered info from this "check" is flushed INCLUDING the data that should have been in the 16 "compare bytes" in the table.

For all i know there could be another "internal" call that is not exported anywhere. It is still slightly a mystery to me other than i understand the gist of it how this func0_relay call works - it just gets a pointer to a memory address and calls something there that resides in memory rather than a call that is exported by the .dll (like nvapi kinda does with the memory addresses to the nvapi functions). This CAN mean it is a NEW one in 11.6+ that should have been called in a similar manner and needs to be found and "relayed" in the same way.

So far.. Best idea i have - not sure how to weed that out tho.

SveSop commented 1 year ago

So, this table that we build for Unknown7 looks kinda like a struct (to me, since i am a novice programmer), i am not sure how this would be much different? I mean, the *table that is being build contains:

size
func0
func1

The relay calls seems oki to me UNLESS some data in the struct needs to be change which i have i must admit expermented extensively with to no avail. The reason this mostly "just works (tm)" is due to the usage of void pointers. Using a void pointer just points to a memory address and can contain whatever struct/data that is in that address, and seems to work as it should IF they dont need any manipulation.

Now, i have also tested if this maybe would mean that the cudaRT api might be expecting a memory address kinda "only found in the windows nvcuda.dll" of sorts, but i tossed together a nvcuda proxy dll in visual studio the other day and replicated what the nvcuda implementation does to see what happens in windows, and the failed result was the same.. so i guess that option is out since the void pointer(address) in windows for the nvcuda.dll ofc was completely different than running under wine and should point to the corrent one returned from the relay call to the real nvcuda.dll. So.. that option seems out.

The size value of the table is calculated and compared with what is returned from the cuGetExportTable result, so there is no more than 2 "calls" in the table - or else the size mismatch and one get the error message of either the "driver too old" or whatever.. so the table itself seem to contain (as posted above) the size value, and these two calls. How thorough would this "check" be tho? Actually checking if the table consist of textual correctness like void * (WINAPI *func0) vs some other? Eg. CUresult (WINAPI *AdapterInfo) or something to that nature? Afaik it would not mean anything for the table itself as the names would probably just be mangled by the compiler anyway right? (Atleast i have not been able to find anything useful when decompiling stuff...)

SveSop commented 1 year ago

Doing quite a bit of debugging in windows, i can follow the ppExportTable addresses and get the int for the size, and the two addresses for the two calls. Ofc the two calling addresses contain completely different functions than this generated table (wish i could decompile that, but tossing out $350 something for IDA Pro aint in my budget atm), so i guess this IS the reason for this "check" of sorts. Why this suddenly had to happen with > 11.5 i dunno.. but well...

The part that irks me over anything is why i am completely unable to just relay the original function untouched like ALL the other cuda functions one do not need to modify in any way.. They are just returned eg: return pcuGetErrorName(error, pStr);, does not need any changes to adapt to the linux ELF lib.

Using the nvcuda.dll proxy i made in windows and just relaying the call back with return pcuGetExportTable(ppExportTable, pExportTableId); works fine, and there is no issues running > CUDA 11.6 when doing that under windows.

Doing the SAME for nvcuda under wine will cause a Unhandled page fault on write access to .... Why would this happen under wine when there is no errors in windows i wonder?

SveSop commented 1 year ago

Right, so doing what @archlitchi mention above works fine in Windows (even if it is not really needed ref my last comment above). As long as the "orig_table" is the same when returning from the relay call, it "passes" whatever test needed and works.

Doing the same under wine, it does not.. it fails on the same Unhandled page fault as it does when just returning the original table without touching it...

0124:trace:nvcuda:Unknown7_func0_relay (12020, 0x650700dc, 0x153b70)
wine: Unhandled page fault on write access to 00000000650700DC at address 00007FC7F5F8E237 (thread 0124), starting debugger...

Notice anything strange? I wonder if it somehow tries to write something to the address 00000000650700DC (which aint really an address, so the fault is understandable)... Where does this "address" appear? Oh.. as "param1" in the relay call Unknown7_func0_relay (12020, 0x650700dc, 0x153b70)

Changing this parameter, or doing whatever to it does not do anything useful it seems, as the place this is apparently located at 00007FC7F5F8E237 is fairly regularly offset from the original func0 address by 647 bytes. (The actual address changes each run ofc, but the difference between func0 and this address seems to be around there)

So.. Why would the Linux library fail on "writing" something here when the windows version do not... It could be that this value is supposed to change to something needed - some sort of payload data to be returned i suppose. Why that would not be needed in windows is unknown to me, or what it should contain.

It could be a wine issue i guess... Not really sure how to figure that out tho.

PS. Yes, i did try to unprotect memory around 00007FC7F5F8E237 as i am not entirely sure if the page fault is indicating that it tries to change data RESIDING at that address (ie. change the param VALUE), or if it is trying to access memory at the address that "param" SHOULD have pointed at?

SveSop commented 1 year ago

wine: Unhandled page fault on write access to 00000000650700DC at address 00007FC7F5F8E237 (thread 0124), starting debugger...

This seems to actually mean that there is a function or something at address 00007FC7F5F8E237 that tries to write to address 00000000650700DC. Running info map in winedbg shows this tidbit: 00000000009d0000 000000007fddffff free So... that particular memory address should be "free". This does i guess still not make it "oki to write", cos i assume since it is "free", it is not allocated and that is the reason?

Anyway, making this address region PROT_READ|PROT_WRITE makes it continue again, but fails with another page fault on write access... so to just go overboard i did the same with quite some memory pages, and it ran without crashing, only to fail on the same error: Error: cudaErrorSoftwareValidityNotEstablished

So i kinda end up at the same spot again, which makes me believe that it is something else at play here. Maybe i am supposed to allocate memory for "param1" BEFORE returning the table to libcuda somehow? (Once again - did not have to do that in windows). I don't really have a clue how Wine does this addressing and loading when mixing ELF/PE in this way - so it is not entirely impossible it is related to some crud there that makes it impossible to solve like this in the first place.

PS. Yeah, all this posting and text is probably for my own notes, but i would not be mad if someone had more ideas here 😏

SveSop commented 9 months ago

Closing this as fixed with https://github.com/SveSop/nvcuda/commit/b1be06d200176aff838032707a9e1d350b4840f5 (although it could use more testing if anyone are interested)

en4bz commented 7 months ago

Table 7 Function 0 has been reverse engineered here: https://github.com/vosen/ZLUDA/blob/master/zluda_dark_api/src/lib.rs#L806

SveSop commented 7 months ago

Well, since i barely know minimal C code, rust is kinda like reading chinese for me i am afraid. You are ofc very welcome to provide some concept code in C that could be implemented vs. the hack that i am using now.

I will look at it tho, but at first glance, i have no clue 😢

913887524gsd commented 5 months ago

Well, since i barely know minimal C code, rust is kinda like reading chinese for me i am afraid. You are ofc very welcome to provide some concept code in C that could be implemented vs. the hack that i am using now.

I will look at it tho, but at first glance, i have no clue 😢

Hey, I'm doing some virtualization about cuda API recently and I have encountered encryption problem in this issue. Now I have translated encrypt part of ZLUDA into C code, its not very complex, and the code is here: https://gist.github.com/913887524gsd/c3479d5b2b235edb1b17f71f3a7fe4f0. I'm not sure whether it's working properly, but for now it's correct on linux, maybe it's not portable for windows dlls. If you are free, you can take a look at my code. (*^▽^*)

SveSop commented 5 months ago

Very interesting indeed.. So, basically my: static void* WINAPI Unknown7_func0_relay(unsigned int cudaVersion, void *param1, void *param2)

is: CUresult encrypt(unsigned int runtimeVersion, time_t timestamp, __uint128_t *res) returning values from this and __encrypt function.

Way beyond my knowledge about most things, but certainly interesting.

            .driverVersion = /* get driver version */,
            .runtimeVersion = runtimeVersion,
            .processID = (unsigned int)getpid(),
            .threadID = (unsigned int)pthread_self(),
            .exportTable1 = /* get export table address using `exportTable1UID` */,
            .exportTable2 = /* get export table address using `exportTable2UID` */,
            .funcPtr = /* the address of encrypt function, we can get it using exportTable2 */,

I suppose one can get the driverVersion from doing a call to cuDriverGetVersion maybe? Not sure about this .funcPtr address tho? Is not this function "the encryption funcion"? 😄

Atleast something to look into for me. Feel free to explore this more. The CUDA samples from 11.6 and up all uses this type of thing for Linux aswell, but you will probably need to build one of the samples for yourself to test this. (Any sample that uses the cuda runtime WILL do this before doing anything else)

Uncertain how this works in practice, but the "skip-over-the-whole-crap" offset i used for this i got from debugging windows code in Visual Studio, and it worked the same for libcuda.so as it did for nvcuda.dll, so i think the NVIDIA codebase is not horribly far off internally. But.. well.. There could be differences behind the scenes in how things could be done ofc, but i would guess if this IS an actual "internal encryption scheme", and not some 3-rd party encryption library that they use, it could work in a similar manner.

913887524gsd commented 5 months ago

I suppose one can get the driverVersion from doing a call to cuDriverGetVersion maybe? Not sure about this .funcPtr address tho? Is not this function "the encryption funcion"? 😄

You can get driverVersion by calling cuDriverGetVersion function as you guessed.
funcPtr is the function address you wish to expose in the export table. For example, if you expose Unknown7_func0_relay in the export table and then have Unknown7_func0_relay call encrypt, the value of funcPtr must be the address of Unknown7_func0_relay.

Uncertain how this works in practice, but the "skip-over-the-whole-crap" offset i used for this i got from debugging windows code in Visual Studio, and it worked the same for libcuda.so as it did for nvcuda.dll, so i think the NVIDIA codebase is not horribly far off internally. But.. well.. There could be differences behind the scenes in how things could be done ofc, but i would guess if this IS an actual "internal encryption scheme", and not some 3-rd party encryption library that they use, it could work in a similar manner.

This method of skipping the checking part of the runtime library is clever, at least I hadn't thought of it before. I checked the assembly code in libcuda.so using decompiling tools and found that this part of code is implemented internally rather than calling 3-rd party API. The execution flow is same as what ZLUDA implements.

SveSop commented 5 months ago

This method of skipping the checking part of the runtime library is clever, at least I hadn't thought of it before. I checked the assembly code in libcuda.so using decompiling tools and found that this part of code is implemented internally rather than calling 3-rd party API. The execution flow is same as what ZLUDA implements.

Oh... i came up with that a late afternoon... after only fiddling with that crap for many many months 😏

Hehe.. anyway, started to look into this and see if i can implement this method :) Would be awesome to have it "done right" than do it like a hack, so lets see how that will work. (Oh.. ChatGPT is actually nice for such things.. improving stuff and translating some cpp -> wine C code and whatnot)

Atleast it gave me something to fiddle with over trying to fix the two other issues i am working on.... Implementing CUFFT in cuda 12.3+ is one, and that seems weird aswell, but once again, it uses a couple more of those "unknown" internal thingys...

SveSop commented 5 months ago

@913887524gsd I am not really sure it works correctly.. could possibly be problems when running under wine anyway, but i am a bit mystified by this structure thing. I have renamed and fiddled a bit with it, so dont mind the struct-names, but you will recognize them. I have tried to replicate the 3 structs like this in my header:

struct EncryptInput1_st {
    int *driverVersion;
    unsigned int cudaVersion;
    unsigned int processID;
    unsigned int threadID;
    const void *exportTable1;
    const void *exportTable2;
    void *funcPtr;
    time_t timestamp;
};
typedef struct EncryptInput1_st EncryptInput1;

struct EncryptInput2_st {
    CUuuid uuid;
    int pciDomain;
    int pciBus;
    int pciDevice;
};
typedef struct EncryptInput2_st EncryptInput2;

struct EncryptInput_st {
    EncryptInput1 part1;
    CUresult (*get_count)(int *);
    CUresult (*get_part2)(int, EncryptInput2 *);
};
typedef struct EncryptInput_st EncryptInput;

Now accessing the "EncryptInput1_st and EncryptInput2_st" by themselves are easily enough if done seperately, but it is afaik supposed to be the return-struct res in the end right? So.. running functions get_count and get_part2 like this works just fine, but lets say i declare something like EncryptInput input, i can fill inn and access stuff with input->part1.driverVersion just fine... But would it not be that i am supposed to have an "array" of various gpu data from EncryptInput2_st "stored" there aswell?

As it is now, input->get_count will be a pointer to the FUNCTION actually getting the number of gpu's, and not the number of gpu's? Is that correct?

And input->get_part2 will only be a pointer.. I somehow kind of thought that i would have something in the lines of input->get_part2[0].pciBus and such info's stored there as an array of information in the case of multiple gpu's as this function seems to set up kind of?

static inline void encrypt_part4(unsigned char res[], EncryptInput *input)
{
    int count;
    LOGGER_ASSERT(input->get_count(&count) == CUDA_SUCCESS);
    for (int i = 0 ; i < count ; i++) {
        EncryptPart4Input part4;
        LOGGER_ASSERT(input->get_part4(i, &part4) == CUDA_SUCCESS);
        for (int j = 0 ; j < 28 ; j++)
            encrypt_hash_round1(res, ((unsigned char *)&part4)[j]);
    }
}

This works.. but it does not actually input any data other than the pointer address to "get_part2" in that struct? Is that intended? I may have missunderstood the concept and tried to kind of unwind this to actually do this, but still i get the error from the executable:

cudaGetDeviceCount returned 103
-> integrity checks failed
Result = FAIL

But as i said, that could have other reasons.. Just want to make sure i "get it"... Because it does not seem very logical to just build the struct containing function addresses to the "fake" nvcuda implementations for GETTING number of gpu's like this.

SveSop commented 5 months ago

Since i am not very good at explaining, this declaration will kinda of look like this:

    EncryptInput input = {
        .part1 = {
            .driverVersion = int *,
            .cudaVersion = unsigned int,
            .processID = unsigned int,
            .threadID = unsigned int,
            .exportTable1 = void *,
            .exportTable2 = void *,
            .funcPtr = void *,
            .timestamp = time_t,
        },
        .get_count = void *,
        .get_part2 = void *,
    };

Is that the intention? Vs something like this:

    EncryptInput input = {
        .part1 = {
            .driverVersion = int *,
            .cudaVersion = unsigned int,
            .processID = unsigned int,
            .threadID = unsigned int,
            .exportTable1 = void *,
            .exportTable2 = void *,
            .funcPtr = void *,
            .timestamp = time_t,
        },
        .get_count = int *,
        .get_part2 = gpuData[],
    };

PS. Yes i am aware that int * would be a address, and gpuData[] would aswell, but not pointing to what it is supposed to point to in the first example...

913887524gsd commented 5 months ago

struct EncryptInput1_st {
    int *driverVersion;
    unsigned int cudaVersion;
    unsigned int processID;
    unsigned int threadID;
    const void *exportTable1;
    const void *exportTable2;
    void *funcPtr;
    time_t timestamp;
};
typedef struct EncryptInput1_st EncryptInput1;

This structure is wrong. In your example, EncryptInput1 and EncryptInput2 need to be a part of hash processing, so the memory layout in them must be same with as I have declared, or it will disturb hash processing. The type of driverVersion must be unsigned int rather than others.

You must check the size of EncryptInput1 and EncryptInput2, their values should be 48 and 28 in 64-bit architecture.

struct EncryptInput_st {
    EncryptInput1 part1;
    CUresult (*get_count)(int *);
    CUresult (*get_part2)(int, EncryptInput2 *);
};
typedef struct EncryptInput_st EncryptInput;

In this structure I use a skill named function pointer, it's like virtual function in cpp. You can have other implement, but I think using a function pointer rather than an array pointer is better, because function can do a lot of other work.^_^

SveSop commented 5 months ago

I am not entirely sure it works the same when using C does it?

And how would you be able to lets say output the pciBus from the 2nd adapter using the EncryptInput struct directly?

Oh.. so the struct uses unsigned int rather than int * as the cuDriverGetVersion function uses then?

913887524gsd commented 5 months ago

Emmm... maybe there is a confusion between us.

The encryption part needs to use two essential structure: EncryptInput1 and EncryptInput2 in your code. these two structs will participate in hash process. So its memory layout must be same with mine.

Your problem occur at the memory layout of EncryptInput1.

struct EncryptInput1_st {
    unsigned int driverVersion; // 4, this type is unsigned int, use cuDriverGetVersion to get it
    unsigned int cudaVersion;   // 8
    unsigned int processID;     // 12, I don't know how do cuda runtime get pid on windows
    unsigned int threadID;      // 16, I also don't know how do cuda runtime get this yet, maybe you should disassemble driver dll
    const void *exportTable1;   // 24
    const void *exportTable2;   // 32
    void *funcPtr;              // 40
    time_t timestamp;           // 48
};
typedef struct EncryptInput1_st EncryptInput1;

I don't know how nvcuda.dll will handle these unix objects like processID, threadID and timestamp. You'd better to check them using decompile tools(also, it's such a dirty work... export id -> export table -> dark function...)

The structure of EncryptInput is free, you can implement it in your wish. If you don't trust function pointer(I know it's tricky for many programmers), you can replace them with array pointer. If you are curious about whether it will work, you can decompile and read its assemble code.

SveSop commented 5 months ago

Well.. i see that unsigned int driverVersion was used in your code, and have updated that. Since it is C code, i need to make a macro to do the "static_assert" thingy, since that does not fly out-of-the-box.. Anyway, in 64-bit code the size of the EncryptInput1_st is 48, and the EncryptInput2_st is 28.. so those two structs seem to be "the same". Since i am using "winegcc" and not "pure GNU gcc" in theory, it should be somewhat compiled windows'ish most of the time when it comes to memory layout and whatnot (in theory...)... And i use WINAPI calls and whatnot.

The thing i was wondering was really the "complete" struct.. but i am a slow learner, so bare with me 😄

Let me try to explain what i understand so far of this nvcuda "hidden" functions. Call cuGetExportTable)(const void** table, const CUuuid *id); The return is a memory address to a table.

This table is then "faked" using the nvapi implementation. In the case of this encryption scheme table thing, this is a table that consists of this:

struct Encryption_table
{
    int size;
    CUresult (WINAPI *encrypt1)(unsigned int cudaVersion, time_t timestamp, __uint128_t *res);
    CUresult (WINAPI *encrypt2)(void *param0);
};

The size is the size of THIS table, and the two CUresult calls are 2 functions that i make... Using your encrypt.cpp example, CUresult (WINAPI encrypt1) would be : `CUresult encrypt(unsigned int runtimeVersion, time_t timestamp, __uint128_t res)`

Next thing that happens after cuGetExportTable returns to the running application, is that cudaruntime will call CUresult encrypt(unsigned int runtimeVersion, time_t timestamp, __uint128_t *res) , and here it is we have to do "the magic".

So.. the runtimeVersion and timestamp is given, and there is a address *res to the cudaruntime table. This needs to be filled in and encrypted and all that. THIS table as far as i get it, is what eventually gets returned after all this encryption thing happens by using this line in your source: https://gist.github.com/913887524gsd/c3479d5b2b235edb1b17f71f3a7fe4f0#file-encrypt-cpp-L172

This means that this new and encrypted table is verified inside the cudaruntime against some sort of hash-whatever, and fail if it is not the same.. So far i am sure we agree.

What i do NOT understand completely is the usage of the EncryptInput2_st table (EncryptPart4Input in your example). Where is the address to THAT particular table in the "main" struct?

struct EncryptInput {
    EncryptPart3Input part3;
    CUresult (*get_count)(int *);
    CUresult (*get_part4)(int, EncryptPart4Input *);
};

The address of part3 is CLEAR.. it is to "part3"... But where is the "part4"?

When using C, the 3 items that gets placed in this table will be like this:

struct EncryptInput {
    EncryptPart3Input part3; // Address to the EncryptPart3Input table
    CUresult (*get_count)(int *); // Address to the get_count function
    CUresult (*get_part4)(int, EncryptPart4Input *); // Address to the get_part4 function
};

cudaruntime are never going to CALL "my" get_count function by using this afaik is it? All the calling to the FUNCTIONS get_count and get_part4 is done using __encrypt +++ functions that are called depending on what is in cudaVersion value.. right?

If in C++ using this sort of method results in THIS:

struct EncryptInput {
    EncryptPart3Input part3; // Address to the EncryptPart3Input table
    CUresult (*get_count)(int *); // Adddress to the *count VALUE
    CUresult (*get_part4)(int, EncryptPart4Input *); // Address to the EncryptPart4Input table
};

THEN i would agree it would probably work fine 😄

PS. I am horrible at explaining myself in stuff i barely understand.. hehe.. sorry

913887524gsd commented 5 months ago

I guess you have missed two functions in my source:

static CUresult default_get_count(int *count)
{
    return cuDeviceGetCount(count);
}

static CUresult default_get_part4(int ordinal, EncryptPart4Input *input)
{
    int device;
    cuDeviceGet(&device, ordinal);
    cuDeviceGetUuid(&input->uuid, device);
    cuDeviceGetAttribute(&input->pciBus, CU_DEVICE_ATTRIBUTE_PCI_BUS_ID, device);
    cuDeviceGetAttribute(&input->pciDomain, CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID, device);
    cuDeviceGetAttribute(&input->pciDevice, CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID, device);
    return CUDA_SUCCESS;
}

CUresult encrypt(unsigned int runtimeVersion, time_t timestamp, __uint128_t *res)
{
    EncryptInput input = {
        .part3 = {
            .driverVersion = /* get driver version */,
            .runtimeVersion = runtimeVersion,
            .processID = (unsigned int)getpid(),
            .threadID = (unsigned int)pthread_self(),
            .exportTable1 = /* get export table address using `exportTable1UID` */,
            .exportTable2 = /* get export table address using `exportTable2UID` */,
            .funcPtr = /* the address of encrypt function, we can get it using exportTable2 */,
            .timestamp = timestamp,
        },
        .get_count = default_get_count,
        .get_part4 = default_get_part4, 
    };
    return __encrypt(&input, res);
}

In my source, encrypt_part4 call get_count and you can see that get_count points to default_get_count and then default_get_count call cuDeviceGetCount to get device number. get_part4 is same.

SveSop commented 5 months ago

I have more than this, but i did not post all the code 😄 I have this aswell:

CUresult get_count(int *count)
{
    return wine_cuDeviceGetCount(count);
}

CUresult get_part2(int ordinal, EncryptInput2 *part2)
{
    // NEEDS CHECKS!
    int device;
    wine_cuDeviceGet(&device, ordinal);
    wine_cuDeviceGetUuid(&part2->uuid, device);
    wine_cuDeviceGetAttribute(&part2->pciBus, CU_DEVICE_ATTRIBUTE_PCI_BUS_ID, device);
    wine_cuDeviceGetAttribute(&part2->pciDomain, CU_DEVICE_ATTRIBUTE_PCI_DOMAIN_ID, device);
    wine_cuDeviceGetAttribute(&part2->pciDevice, CU_DEVICE_ATTRIBUTE_PCI_DEVICE_ID, device);
    return CUDA_SUCCESS;
}

And i have all the part1-6 and whatnot, but the question was more or less how i am to interpret the function table that gets encrypted.. But i may be looking at this all wrong.. Maybe the only thing that gets returned in the end is the encryption of

struct EncryptPart4Input {
    CUuuid uuid;
    int pciDomain;
    int pciBus;
    int pciDevice;
};

And not the "whole" struct of

struct EncryptInput {
    EncryptPart3Input part3;
    CUresult (*get_count)(int *);
    CUresult (*get_part4)(int, EncryptPart4Input *);
};

Hmm..

913887524gsd commented 5 months ago

I have no idea about why you get wrong hash number...

Maybe you can check it with ZLUDA test: https://github.com/vosen/ZLUDA/blob/master/zluda_dark_api/src/lib.rs#L1150 ?

cudart_export_table is exportTable1
anti_zluda_export_table is exportTable2

Other is almost same with I have declared

SveSop commented 5 months ago

static CUresult __encrypt(EncryptInput *input, __uint128_t *res) // Table + *res (res is "untouched")
{
    if (input->part3.runtimeVersion % 10 < 2) { // Check what iteration of the runtime call.. eg. 12040-12041
        static unsigned char code[16] = {
            0x5c, 0x67, 0xcb, 0x03, 0x1c, 0x18, 0x41, 0x33,
            0xe8, 0xd1, 0x4c, 0x1f, 0xaa, 0x83, 0xd3, 0x8e
        };
        *res = *(__int128_t *)code; // Just modify the *res - EncryptInput struct = not used
        if (input->part3.runtimeVersion % 10 == 1) // If reiteration == 12041
            ((unsigned char *)res)[7] = 24; // Modify *res directly. EncryptInput struct = not used
    } else {
        unsigned char result[66] = {}; // Create empty char array
        unsigned char aux[16] = {}; // create empty char array
        encrypt_part1(aux);
        encrypt_part2(result, aux); // Modify result array with the help of aux array
        encrypt_part3(result, input); // use ONLY "part3" of the EncryptInput struct ie. driverVersion, processID+++
        encrypt_part4(result, input); // Only use "part4" struct... part3 struct is only used for calls ie. get_count and get_part4
        encrypt_part5(result); // Only modify array
        encrypt_part6(result); // Only modify array
        encrypt_part7(result, aux); // Only modify array
        encrypt_part5(result); // same
        encrypt_part6(result); // same
        *res = *(__uint128_t *)result; // return Array cast to -> __uint128_t *res
    }
    return CUDA_SUCCESS; // Actually return to cuda runtime library

So.. i think i have been going about this the wrong way. It only uses the "part3" of the struct once it seems, to encrypt some crap and add it to the char array.

This way, in a sense the whole:

struct EncryptInput {
    EncryptPart3Input part3;
    CUresult (*get_count)(int *);
    CUresult (*get_part4)(int, EncryptPart4Input *);
};

Is not really needed and only used for "convenience" it seems? Getting closer? 😄

913887524gsd commented 5 months ago

Oh, yes, only EncryptInput1 and EncryptInput2 in your code are strict. EncryptInput is free, it won't participate in hash process. o(￣︶￣)o

SveSop commented 5 months ago

Right.. uploaded the encryption branch : https://github.com/SveSop/nvcuda/tree/encryption

I have a sneaky suspicion that there are possibly a couple of wild-cards in these values:

What is known to probably be "right": .driverVersion = 12040 //This is so far true.. although it is used as unsigned int rather than the int that the call uses .cudaVersion = xxx // This comes from cudaruntime, so must* be correct .timestamp = xxx // This also comes from cudaruntime, so i cant see this being "wrong"

What could be different: .processID = xx // Needs to be investigated.. Dunno if processID is something different WINE sets to a process vs. getpid() that gets a "system process is". .threadID = xxx // Same here.. maybe some WINE call to get this? .exportTable1 = // This could be a issue.. is it the "fake" exportTable1, or the "real" one? .exportTable2 = // Same.. is it supposed to be the "fake" one? Or the "real" one? .funcPtr = // This one? Is it the REAL function from the original, or is the function pointer to the fake one (in my case, the one i am currently IN)

I suspect one or more of these are not "correct" and that is what fails the encryption... and I suppose it will throw off the whole thing is one of these are off by anything at all

SveSop commented 5 months ago

I was able to manipulate this "exportTable2" address by changing the stack when i was doing some debugging on windows, and if i changed it BACK to the "original" address, just before i returned, it went through. This was when i was fiddling with this hack to return "after" the check that i ended up doing... So i suspect that possibly "exportTable2" should point to the "fake" table that i use to intercept this call.

I will see if i manage to implement this thing in my windows relay library, and perhaps see if i can get it working there.. so its easier to compare what values i would have here and there...

SveSop commented 5 months ago

I did not get it to work with windows and my relay library either, so i must be doing something wrong i guess 😄

Since visual studio does not seem to have support for __uint128_t and __int128_t, how did you solve that? I used:

typedef struct uint128 {
    unsigned long long low;
    unsigned long long high;
} uint128_t;
typedef struct int128{
    long long low;
    long long high;
} int128_t;

And used uint128_t and int128_t , but that does not mean it is right...

PS. Looking at the memory addresses of the *res return, it DOES look "correct", but ofc, it will change every time due to the encryption... but it looks and behaves in a similar manner as the native nvapi.dll does (16 bytes of "data" x 3).

Also was not able to just do https://gist.github.com/913887524gsd/c3479d5b2b235edb1b17f71f3a7fe4f0#file-encrypt-cpp-L157 directly as that gave some conversion error using visual studio.. so ended up with:

            uint128_t value;
            std::memcpy(&value, code, sizeof(uint128_t));
            *res = value;

And that might not be correct either 😏

913887524gsd commented 5 months ago

EncryptInput1 input1 = {
        .driverVersion = (unsigned int)version,
        .cudaVersion = cudaVersion,
        .processID = (unsigned int)getpid(),
        .threadID = (unsigned int)pthread_self(),
        .exportTable1 = (void *)&Unknown1_orig,
        .exportTable2 = (void *)&Encryption_orig,
        .funcPtr = Encryption_orig->encrypt1,
        .timestamp = timestamp,
    };

The value of exportTable1 and exportTable2 should be the fake one
funcPtr should be the Encryption_encrypt1 in your source, because this is the interface you exposed to export table Encryption_Impl

Since visual studio does not seem to have support for __uint128_t and __int128_t, how did you solve that?

I don't know how do windows implemented 128-bit type... The current architecture is all small end architecture, memcpy is also ok I guess.

SveSop commented 5 months ago

In this part : https://gist.github.com/913887524gsd/c3479d5b2b235edb1b17f71f3a7fe4f0#file-encrypt-cpp-L160-L172 you do it a bit different than i understand it to be in the ZLUDA code here? https://github.com/vosen/ZLUDA/blob/master/zluda_dark_api/src/lib.rs#L874-L888 In the ZLUDA source it seems to be 2 different "hash rounds" being run, that does things slightly differently..

If you are interested, i can invite you to my repo where i have the non-working implementation of the nvcuda.dll relay lib i have, so you can see how i have tried to tie it together perhaps?

913887524gsd commented 5 months ago

In this part : https://gist.github.com/913887524gsd/c3479d5b2b235edb1b17f71f3a7fe4f0#file-encrypt-cpp-L160-L172 you do it a bit different than i understand it to be in the ZLUDA code here? https://github.com/vosen/ZLUDA/blob/master/zluda_dark_api/src/lib.rs#L874-L888 In the ZLUDA source it seems to be 2 different "hash rounds" being run, that does things slightly differently..

The version implemented by ZLUDA is duplicated, I replaced that part with an equivalent implementation. I can guarantee that my hash process is correct, you just need to dive into input parameters.

SveSop commented 5 months ago

It seems the comment i made last night was not sent for some reason, so i try again 😄

Right.. so i ran the "test" that seems to be here: https://github.com/vosen/ZLUDA/blob/master/zluda_dark_api/src/lib.rs#L1150-L1174

I did it "manually" by just filling in the data, and let it run the encryption. I did this in my main function:

EncryptInput1 input1 = {
    12020,
    11082,
    0x0000000000004D08,
    0x0000000000002B78,
    (void*)0x00007FF8C80717F0,
    (void*)0x00007FF8C825E4B0,
    (void*)0x00007FF8C7DD0AD0,
    0x0000000064A365EE,
};
return encrypt(&input1, res);

And in the "get_part2" function, i did:

part2->uuid = { 0x67, 0x22, 0xCB, 0xCF, 0xC6, 0x61, 0xF2, 0x92, 0x74, 0xD6, 0xED, 0x23, 0x2A, 0x32, 0x13, 0x1C };
part2->pciBus = 4;
part2->pciDevice = 0;
part2->pciDomain = 0;
return CUDA_SUCCESS;

I think what i have input is the same values as the RUST "test" code, and i compared the output after the encryption to mine, and it was NOT the same. The RUST code compares this to: uint128_t 0xEAF1313342BFCD84A7C34628F214707A

Which for me i think when using the uint128_t implementation i mentioned above SHOULD have been: res.low = 0xA7C34628F214707A; res.high = 0xEAF1313342BFCD84;

But that is nowhere near what i get, and the even funnier part is that it changes each time i run.. so i have for sure missed something vital 😢 Using "static" input values like this should have produced the same uint128_t each time i believe.... I will go over the functions in my code again, because there must be something i either have missed, or the small changes i made (due to compilation error) maybe do something it should not...

913887524gsd commented 5 months ago

Please push up your test code? Maybe I can help find the problem inside. 😄

SveSop commented 5 months ago

I pushed some debugging text with comments here: https://github.com/SveSop/nvcuda/commit/a1e38e4ffe9fa233e1e8e41129f30b450a6e71ad

It seems as there is something weird happening when it runs the encrypt_hash_round1 function.. I did this:

    for (int i = 0;i < 66; i++) TRACE("%02X", res[i]);
    TRACE("\n");
    // This value is the same every run

    for (int i = 0 ; i < 48 ; i++)
        encrypt_hash_round1(res, ((unsigned char *)&input1)[i]);

    TRACE("Part 3 After: ");
    for (int i = 0;i < 66; i++) TRACE("%02X", res[i]);
    TRACE("\n");
    // This value changes each run after it is done with encrypt_hash_round1

The FIRST printout of this is the same every time it prints, the 2nd printout (after running the encrypt_hash_round1 function) changes every time i run it.. The "hash output" of BOTH these prints should be the same every run.. yes.. just so you do not missunderstand me here is an example output:

Executable run1:

0124:trace:nvcuda:encrypt_part3 Part 3 Before: 8B219A49E86D1AEEF237F9B54A8C3C75C71EEE21CF298AE51383F4EC3304E2FDB02F09014FF7686D6946437EB62B21ED57A110860E60441E705F67D1EB67A13D003D
0124:trace:nvcuda:encrypt_part3 Part 3 After: CCDFF1CD9ADE6421D63209AFF02A8C74878808A38CD882762576C532AC788241F2272FB7B1DAAD9C336F44375A47C93C25ED028590B2B8CECE556DAD0DDE5A690069

Executable run2:

0124:trace:nvcuda:encrypt_part3 Part 3 Before: 8B219A49E86D1AEEF237F9B54A8C3C75C71EEE21CF298AE51383F4EC3304E2FDB02F09014FF7686D6946437EB62B21ED57A110860E60441E705F67D1EB67A13D003D
0124:trace:nvcuda:encrypt_part3 Part 3 After: 67873B7A90B178410D7AE5551239023C61EFE1E7755FA389D23F2C535F8B2B3B891C48CD81239578C60540962B3A52C43B1411293ED5807EB8416CB152B08AAB00AB

As you can see.. the "Before" value is the same for "run 1" and "run 2", but the "After" value is different.. given the same input values, the encryption should not "randomize" the encrypted output in the "After" value...

913887524gsd commented 5 months ago

Emmm, I know you have done a pointer mistake...

for (int i = 0 ; i < 48 ; i++)
        encrypt_hash_round1(res, ((unsigned char *)&input1)[i]);

input1 is a pointer points to a struct to be hashed, and then you take the pointer's address to do hashing.

for (int i = 0 ; i < 48 ; i++)
        encrypt_hash_round1(res, ((unsigned char *)input1)[i]);

This is true code.

SveSop commented 5 months ago

Indeed.. that made the "test" return correct values and no horrible changes there! Yay! 😄

How could i miss that.. yeah, i changed it from &input->part3 or whatnot to just the input struct.. and did not think about that. Anyho, still does not work in Linux, so i will test this with windows now and see what comes up. Still could be some issues with this getpid stuff and whatnot

SveSop commented 5 months ago

Well.. The "test" seems to run on windows aswell, although there was some strangeness with how i had to implement uint128_t me thinks..

typedef struct uint128 {
    unsigned long long low;
    unsigned long long high;
} uint128_t;
typedef struct int128{
    long long low;
    long long high;
} int128_t;

I have to do this in windows as it seems visual studio does not have 128 bit sizes...

This again, made me have to do this in the "encrypt" function, or else the values in *res got reversed somehow. (low == high, high == low) for some strange reason?

I did this:

uint128_t value;
memcpy(&value.high, &result[0], sizeof(uint64_t));
memcpy(&value.low, &result[sizeof(uint64_t)], sizeof(uint64_t));
*res = value;

Not so easy to just "cast" a char array -> uint128_t on windows using visual studio maybe?

Doing that made the result "correct" when using the static values.. so the encryption seemingly are working with the error you caught for me 😄

Running it "as it should" still did make it correct, and i made sure i use the address for the "fake" function tables for both table1 and the encryption table, aswell as the functionpointer to Encryption_encrypt0.

Ill see if i can do more comparisons between what the driver outputs vs the result i am getting... Ah well.. if it was easy.. and so on and so fort. Thanks for all your help so far and sorry if i am spamming questions 👍

SveSop commented 5 months ago

Well for windows it seems simpler to use this:

            uint128_t value;
            std::memcpy(&value, result, sizeof(uint128_t));
            *res = value;

And using this for the "test value":

static unsigned char code[16] = {
    0x7a, 0x70, 0x14, 0xf2, 0x28, 0x46, 0xc3, 0xa7,
    0x84, 0xcd, 0xbf, 0x42, 0x33, 0x31, 0xf1, 0xea
};

uint128_t constant_value;
std::memcpy(&constant_value, code, sizeof(uint128_t));

Using that, did not "reverse" the values, so i suppose that i should not have just "translated" the check value from RUST like i did, since visual studio is not 110% up with 128 bit types.

So.. yeah, the test using the encryption method works flawlessly like that, but alas - no dice when it comes to actual usage using the supposed obtained values😢

PS. The reason i am fiddling with visual studio code instead of "only" wine, is that i like to get stuff to work where it 100% should work.. if i can replicate something where it is "known working" like in windows with nvidia adapter, atleast i do not start with the hardest part i guess.

SveSop commented 5 months ago

@913887524gsd That should sort it! 😄 Thanks a lot for all your work!

Not looked into it, but replacing the "linux" getpid() and pthread_self() with the windows ones seemed to fix it.. It could be some wine'ish implementation that makes it work.. dunno. Maybe the "WINE" thread is not the same as the "LINUX" thread? Going through "wineserver" layer or somecrap.

That and the somewhat iffy implementation of __uint128_t and the likes made me choose to use memcpy when moving char arrays over to it. It worked with CUDA 12.x, but crashed with CUDA 11.6+ without that for some weird reason.

Anyway, seems to work now, and a lot more correct and elegant method (until they change the encryption scheme perhaps).

913887524gsd commented 5 months ago

Good Job!!!!!😄😄😄😄😄

Not looked into it, but replacing the "linux" getpid() and pthread_self() with the windows ones seemed to fix it.. It could be some wine'ish implementation that makes it work.. dunno. Maybe the "WINE" thread is not the same as the "LINUX" thread? Going through "wineserver" layer or somecrap.

Just replacing get_pid and pthread_self to nt version... Interesting...

Anyway, seems to work now, and a lot more correct and elegant method (until they change the encryption scheme perhaps).

They won't try to change encryption for backward compatibility I guess. It's a meaningful work!!!(btw, nijika is so cute XD)