Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA

Ageless93 commented 7 years ago

As reported on the BOINC forums, with newer Nvidia GPUs with more than 4 gigabyte video memory, BOINC reports they have a max of 4096MB for CUDA. OpenCL detects the memory correctly.

Checking https://github.com/BOINC/boinc/blob/master/client/gpu_nvidia.cpp, line 195, total memory is checked with an int. An int is 32bit, thus the maximum memory it can address is 4GB. In this case I think it needs a long long int here (referencing http://en.cppreference.com/w/cpp/language/types) to be able to detect over 4GB of memory (64bit).

AenBleidd commented 7 years ago

I think 'int' is the result of function call, it is the status. Function return 'Max Memory Value' in its first argument which is size_t. Are you using 64 bit or 32 bit BOINC?

ChristianBeer commented 7 years ago

This seems to be a "feature" of the CUDA SDK. Older versions always report the memory as 32bit integer. Exhibit 1, Exhibit 2.

I guess one has to experiment which CUDA SDK reports the correct amount and use this to built the Client with. The client itself uses a size_t to retrieve the value which than gets converted into a double. If we compile on a 64bit machine the size_t should be big enough.

AenBleidd commented 6 years ago

@RichardHaselgrove, you've been working with the similar staff while fixing #2706. Does the issue described here is still present?

RichardHaselgrove commented 6 years ago

And even longer than that. We had a problem when the first 2GB cards came out, with garbage sizes reported for a while. Berkeley couldn't run to the latest hardware, so Rom remoted into a machine I'd just built to debug and create 26f9e380d9f165c15bca8ccf2b6ef7ce2b50add0.

The 4 GB reporting limit still applies (and has been reported again in the context of the RTX 2080 release), but it would involve delving into the NVidia SDKs to find a usable 64bit implementation which could co-exist with our legacy 32-bit code and wouldn't lose our support for older cards.

Ageless93 commented 4 years ago

All right, I got a code fix in for this problem. The problem lies in the use of cuDeviceTotalMem which is a 32bit value. Using cuDeviceTotalMem_v2 will fix this for GPUs with more than 4GB. It needs further testing with older GPUs and in 32bit OS.

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..73ba787 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,19 @@ void* cudalib = NULL;
p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
- p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+ if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+ p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
- p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
- p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ if (sizeof(void *)<8 ||
+ !(p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate_v2")) ||
+ !(p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy_v2"))) {
+ p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
+ p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
+ }
p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
- p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+ if (sizeof(size_t)<8 || !(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))
+ p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
#endif

if (!p_cuDriverGetVersion) {

Ageless93 commented 4 years ago

And the corresponding explanation:

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.

The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem_v2"))) p_cuDeviceTotalMem = (int()(size_t, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.

The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

Ville-Saari commented 4 years ago

The above explanation may be a bit confusing as it was concatenated together from multiple different posts I wrote on Seti@Home GPUUG team forum.

AenBleidd commented 4 years ago

@Ville-Saari, if it's your code above, could you please make a proper PR? Also, it was said by @davidpanderson, that there will be no x86 client anymore, so all 32-bit checks could be removed, I believe.

RichardHaselgrove commented 4 years ago

I'm working on converting this code for both Windows and Linux. The original patch is giving me compiler warnings on VS 2013, but working. If anybody wants to re-code it in accordance with BOINC house style, please feel free.

The theory is simple: if _v2 versions of cuDeviceTotalMem and cuMemGetInfo exist in the active library, use them. Otherwise, keep the originals. I think the _v2 libraries became available around 2012, with CUDA 4.

Ville-Saari commented 4 years ago

Those 32 bit checks won't actually produce any code as they are constant expressions that make the compiier remove the test and just conditionally compile one of the resulting branches depending on what kind of architecture it is compiling for.

I think using those 32 bit versions of the calls without _v2 in their name when running on 64 bit system may even be dangerous. The fact that the total memory query even works at all with the 32 bit call relies on the code running on a little-endian system and the variable being explicitly initialized to 0, so the result ends up in the less significant half of the 64 bit integer. That function is meant to receive a pointer to a 32 bit value.

RichardHaselgrove commented 4 years ago

Yes, we had that problem when 4GB GPUs first came on to the market in 2012. @romw had to diagnose it for himself to make 2882d5bc2980047ca78260dadf8e752cac0bdb8e - we've been OK since then.

CharlieFenton commented 4 years ago

I added a workaround for this problem on Macs in 2013. See commits 631e236b08 and 4d74c5abbd (26 and 27 June 2013.) I hope you have found a better way to do this.

KeithMyers commented 2 years ago

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

AenBleidd commented 2 years ago

I don't know why this is still hanging out there. I've been using Ville's corrected code for years now. One simple change to a client module and problem is solved.

Why haven't anybody simply merged the code?

Probably because nobody have created a proper PR for this.

KeithMyers commented 2 years ago

Why can't you do that? You are a developer. Forget about Ville doing it. He left BOINC once Seti finished. No sign of him since then.

AenBleidd commented 2 years ago

Actually, he replied back two years ago: https://github.com/BOINC/boinc/issues/1773#issuecomment-602094890 Personally I don't want to commit code without permissions of its author(s).

KeithMyers commented 2 years ago

So since that is never going to happen, i.e. he had left no contact information to get ahold of him, this issue will never be fixed? So how to get around not having permission from the author? None? He gave permission to use his code for all GPUUG team members. Does that suffice?

AenBleidd commented 2 years ago

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

So how to get around not having permission from the author? None?

Usually you have to reimplement/fix the functionality without using the original code. Otherwise it's a violation of intellectual property rights.

KeithMyers commented 2 years ago

So I need to find an explicit permission message from Ville somewhere in all the threads he posted. I don't think he ever did that. He just created our Pandora client for the team members and implicitly gave permission to use his code.

KeithMyers commented 2 years ago

Vitalii, I have these forum messages as the closest I can come for an explicit permission case for Ville's code fix.

Hey Ville,

Can you post (or PM to me) the sections of code that you changed to implement the memory fix? I was talking to Richard about it, and he would like the information to make an official report on github to try to get it implemented to the official BOINC code. This is something that I think is beneficial to the community as a whole.

Thanks!

Here you are:

diff --git a/client/gpu_nvidia.cpp b/client/gpu_nvidia.cpp
index 5d6ddc2..b0689c9 100644
--- a/client/gpu_nvidia.cpp
+++ b/client/gpu_nvidia.cpp
@@ -306,13 +306,15 @@ void* cudalib = NULL;
     p_cuDeviceGet = (int(*)(int*, int)) dlsym( cudalib, "cuDeviceGet" );
     p_cuDeviceGetAttribute = (int(*)(int*, int, int)) dlsym( cudalib, "cuDeviceGetAttribute" );
     p_cuDeviceGetName = (int(*)(char*, int, int)) dlsym( cudalib, "cuDeviceGetName" );
-    p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
+    if (!(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2")))
+       p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );
     p_cuDeviceComputeCapability = (int(*)(int*, int*, int)) dlsym( cudalib, "cuDeviceComputeCapability" );
     p_cuCtxCreate = (int(*)(void**, unsigned int, unsigned int)) dlsym( cudalib, "cuCtxCreate" );
     p_cuCtxDestroy = (int(*)(void*)) dlsym( cudalib, "cuCtxDestroy" );
     p_cuMemAlloc = (int(*)(unsigned int*, size_t)) dlsym( cudalib, "cuMemAlloc" );
     p_cuMemFree = (int(*)(unsigned int)) dlsym( cudalib, "cuMemFree" );
-    p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
+    if (!(p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo_v2" )))       
+       p_cuMemGetInfo = (int(*)(size_t*, size_t*)) dlsym( cudalib, "cuMemGetInfo" );
 #endif

     if (!p_cuDriverGetVersion) {

And Juan's reply right after that.

Richard now has the code for the memory fix since Ville agrees to share.

RichardHaselgrove commented 2 years ago

If that's me, I'm suffering severe trainlag after returning from Portugal yesterday. Give me a clickable link for the earlier public-domain references, and for that code snippet, and I'll see if I can make a PR while waiting for my laundry to finish.

KeithMyers commented 2 years ago

Thank you David.

KeithMyers commented 2 years ago

I see that David's code is not correct yet. Doesn't build for Android or Linux. If you use the original Ville code, the client builds correctly and reports the correct amount of RAM for Linux hosts.

KeithMyers commented 2 years ago

Richard, Ville's and Juan's comments weren't in any public domain. They were in our private GPUUG team forum at Seti@home. I assume that you received the code via PM from Juan. Check you Seti PM records.

RichardHaselgrove commented 2 years ago

Same answer as I've just given in #4757. No, I didn't get any code from Juan: I understand that was a collective decision of GPUUG. He did allow me to download and run precompiled executables.

KeithMyers commented 2 years ago

OK, you only got the executables. So how did the code snippet end up here on #1773? I see Ville posted an explanation here of the problem and its solution, so I can assume he posted the code snippet.

Why doesn't that satisfy the requirements?

KeithMyers commented 2 years ago

OK, it looks like Jord got the code from Ville from the posts.

RichardHaselgrove commented 2 years ago

Checked my PMs. Yes, Juan declined to send me code (collective decision of GPUUG), but that was to do with GPU spoofing to break the 'max tasks per GPU' cache limit.

I was sent code to address this current issue, but that was by Ian&Steve C. - and very late in the lifetime of the SETI@Home project, when we were all getting very tired and irritable. The associated discussion thread in the forum was eventually hidden.

Ageless93 commented 2 years ago

Late last night I saw I was being mentioned, and I must say I didn't even remember any of this. So I looked around and found that I got a PM at Seti from Juan BFP with the code, and one following with a long explanation. I'll add that to this thread. It may explain the problems you guys talk about in the other ticket.

Ageless93 commented 2 years ago

As requested this is the related info of the dev of the fix.

Boinc uses a function named cuDeviceTotalMem() to get the memory size. That function is ancient and clamps the size to 32 bits even in modern versions of the cuda lib probably to maintain compatibility.

I checked what Special Sauce does to get the memory, but it was using a completely different api so doing that in boinc client would have required huge changes.

Then I snooped what symbol names the libcuda.so where that cuDeviceTotalMem comes from contains. There was a mysterious cuDeviceTotalMem_v2, so I tried what will happen if I make Boinc call that instead. It worked!

I made the change so that if the symbol is not found, then it uses the old version. So if the client is running in a system where the nvidia drivers are so ancient that the _v2 function doesn't exist, then it will just show wrong memory size instead of crashing.

This change is always active regardless of the result of the team check. Boinc does the GPU detection before the team check can be done, so making it conditional wasn't easy to do. But I don't think this is a problem because this isn't a 'cheat' but a legit bug fix.

The library has the old 32 version in it to stay binary compatible with the code compiled against the old version of the library. Any new 64 bit code normally linked with the library would be compiled using 64 bit header files that add that _v2 to the symbol name resolved from the library when the code calls the function and everything works correctly.

But boinc isn't using the library with normal linking. It wants to avoid dependency on Cuda development stuff, so it doesn't use any headers and accesses the library 'the hard way' by the running code finding the library file and extracting indvidual symbol names from it and casting them to function pointers to be used. If you do 'manual' linking this way, then it is your responsibility to handle the different function versions too. Boinc didn't do this, so the bug was entirely in their end.

Also my version is a hack that may fail or even crash if you compile the code on 32 bit environment. My code uses the old version only in case the new version doesn't exist in the library, but to make it compatible with 32 bit environment, another test is needed. Something like:

if (sizeof(size_t)<8 || !(p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem_v2"))) p_cuDeviceTotalMem = (int(*)(size_t*, int)) dlsym( cudalib, "cuDeviceTotalMem" );

"sizeof(size_t)<8" is a constant expression evaluating to always true on a 32 bit systems, so the compiler will optimize the entire if test away leaving just the old line.

Also there are other functions using size_t in the code besides the two I modified. They might benefit form doing this same thing too.

The difference is that now I make the 'bitness test' before using the _v2 variants. I also added _v2 variants to cuCtxCreate and cuCtxDestroy and for them I do the test once and replace them both with _v2 or neither of them. Because I don't think it is a smart idea to create a contect with one version and release it with a different version.

cuMemAlloc and cuMemFree also have different versions for 64 bit but I didn't modify them because Boinc isn't actually using them. Just testing their existence. The function signatures for them in this boinc code are actually plain wrong so if it tried to actually use them, it would probably crash or do something unintended.

Ageless93 commented 2 years ago

Found one more private message from Juan, which I read as permission to use the code:

We are happy to share our findings with the rest of the community. There are some others enhancements added to our client, some are very interesting. We could talk about them in the future.

RichardHaselgrove commented 2 years ago

Well, at one level, we don't need any permissions from anyone. The calls we're using are all documented in the CUDA Toolkit Documentation (currently at v11.7.0) - at section 6.5, Device Management, of the CUDA Driver API

Ageless93 commented 2 years ago

It's what Vitalii was asking for:

He gave permission to use his code for all GPUUG team members. Does that suffice?

If he gave his permissions and you have a proof of it - I can use that code and create a PR.

RichardHaselgrove commented 2 years ago

I think David - with #4757 - has somewhat pre-empted that request.

For the record, the majority of the _v2 variants aren't explicitly covered in the toolkit documentation, but are listed in https://forums.developer.nvidia.com/t/driverapi-problem-with-cudevicetotalmem/25585 (January 2012).

AenBleidd commented 2 years ago

@KeithMyers and @Ageless93 posted two different diffs. Both of them are looks similar to what is done by David in #4757 but the last one is not working properly as far as I see from the comments. I didn't check both diffs posted above in this thread, but I have a feeling that these two are just an old versions, and is not exactly the same that was used when building your customized client. So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

Ageless93 commented 2 years ago

So it would be nice if @Ville-Saari replies back in this thread and either create a proper PR or at least share the latest diff that was used to build your customized client.

Yes, but I don't think it was his code, hence why he never made the PR. I think, going by what Juan posted to me (and what I quoted above for the second time, as I quoted it before I see on the 21th March 2020) it was his code.

AenBleidd commented 2 years ago

@Ageless93, ok, then it would be nice to reach to Juan somehow and kindly ask them to share their work.

davidpanderson commented 2 years ago

Is there a case where #4757 isn't working?

AenBleidd commented 2 years ago

@davidpanderson, according to this report it doesn't work on linux: https://github.com/BOINC/boinc/pull/4757#issuecomment-1146671322

IanSteveC commented 2 years ago

It was always Ville's code. he wrote it. Juan was just acting as a liaison until Ville reached out himself in here.

Both Juan and Ville have been MIA and unreachable for over a year. that's why neither of them have been here to contribute comments. we already tried reaching both of them.

BOINC / boinc

Newer Nvidia GPUs with more than 4GB memory, report max 4GB memory under CUDA #1773