Issues on OpenCL version on latest Macos

oscarbg commented 11 years ago

Hi I'm testing your OpenCL code on Macos 10.8.4 betas(but I suspect should be equal on 10.8.3) with 79xx for MAC and I get some errors on quicksort test (OpenCL_radixsort_benchmark_Apple_xcode4_debug) and perhaps for that also seems for that I get crashes on either OpenCL_gpu_rigidbody_pipeline2_Apple_xcode4_debug or OpenCL_gpu_rigidbody_pipeline2_Apple_xcode4_debug some variables get out of range values and samples crash.. So have you tested your codes on Apple+AMD GPUs or is a new bug either in new Macos builds or 79xx cards on Apple.. Also some OCL kernel compilation fails in Quadro Fermi for Mac saying something about PTX lowering error.

can you test and fix this bugs?

oscarbg commented 11 years ago

more details on nv crash ompiling kernel SubtractKernel ready. compiling kernel FillIntKernel ready. compiling kernel FillUnsignedIntKernel ready. compiling kernel FillInt2Kernel ready. Error in clBuildProgram, Line 792 in file /Users/oscar/Documents/experiments-master/build/xcode4/../../opencl/basic_initialize/btOpenCLUtils.cpp, Log: ptxas fatal : Internal error: overlapping offsets allocated to objects !!!

Assert solverSetupProg in line 130, file /Users/oscar/Documents/experiments-master/build/xcode4/../../dynamics/basic_demo/Stubs/Solver.cpp ^CTrace/BPT trap: 5

erwincoumans commented 11 years ago

Can you try latest trunk? Replace APPLE1 with APPLE in btRadixSort32.cpp

https://github.com/erwincoumans/experiments/commit/989219bced402a27b8e4f617f82332acb641dc8d

oscarbg commented 11 years ago

Ok radix sort correctly works now on ATI cards both 58xx series and 79xx series on MAC!.. but there is a BIG slowdown in radix sort pert see below (apron 13x on 5800 series and 20x on 7900 series) and sorting errors seems the same for both cards so still hope you can fix fast radix sort kernels to work on mac! ( I can help testing fixes)

error: INCORRECT: [0]: 0 != 1423 result[...0, 0, 2312, 2312, 2316, ...] reference[...1423, 2285, 2469, 2673, 3701, ...]

performance old way (incorrect) 58xx series , 48.200000 GPU ms, 87.018755 x10^6 elts/sec 79xx series , 19.200000 GPU ms, 218.453333 x10^6 elts/sec

correct way (__APPLE_1 to APPLE fix) 58xx , 645.100000 GPU ms, 6.501789 x10^6 elts/sec 79xx , 371.900000 GPU ms, 11.278042 x10^6 elts/sec

erwincoumans commented 11 years ago

The slow serial fallback works correct here on a 5870. Does the slow version give you incorrect results?

oscarbg commented 11 years ago

Sorry for not so good report.. Slow version is correct.. I only wanted to attach error output in fast code to show that error output is the same for 58xx and 79xx cards so probably fixing one fixes the other..

erwincoumans commented 11 years ago

Can you test the Test_BitonicSort in the bullet3 repository, and report the performance for the GPUs? See https://github.com/erwincoumans/bullet3

oscarbg commented 11 years ago

For sure, This program was compiled using the Unknown Vendor OpenCL SDK

Num Platforms = 1

Platform info for platform nr 0: CL_PLATFORM_VENDOR: Apple CL_PLATFORM_NAME: Apple CL_PLATFORM_VERSION: OpenCL 1.2 (Apr 2 2013 22:13:45) Num Devices = 2 m_deviceName = AMD Radeon HD 7970 Compute Engine compiling kernel kBitonicSortCellIdLocal ready. compiling kernel kBitonicSortCellIdLocal1 ready. compiling kernel kBitonicSortCellIdMergeGlobal ready. compiling kernel kBitonicSortCellIdMergeLocal ready. Error: [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel failed: total work group size (512) is greater than the device can support (256) error Assertion failed: (0), function MyFatalBreakAPPLE, file /Users/oscar/bullet3/build/xcode4/../../src/Bullet3OpenCL/Initialize/b3OpenCLUtils.cpp, line 81.

oscarbg commented 11 years ago

Hi adding cl_device_id to info.dev and cl_int clerr=clGetKernelWorkGroupInfo (info.bitonicSortLocal1, info.dev, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &res, NULL); if((clerr==CL_SUCCESS)&&(res>0)) info.localSizeLimit=res; at start of bitonicSortNv seems to do the trick results:

79xx GPU sort took 120 ms Correct

58xx GPU sort took 258 ms Correct

erwincoumans commented 11 years ago

Radix sort of 8 million integers takes about 12ms under Windows on a 7970. It would be good to track down the issue, but I don't have the hardware and it is hard to debug under Mac OSX. Are you using a Mac Pro or a Hackintosh?

Do you want to help tracking down the issue? It is isolated to one kernel, SortAndScatterSortDataKernel in src\Bullet3OpenCL\ParallelPrimitives\kernels\RadixSort32Kernels.cl SortAndScatterSortDataKernelSerial is a working version, but it only uses a single thread in each compute unit/work group, hence the slow down. It would be best to use the bullet3 repository.

You could add debugging information (extra buffers) to see where things go wrong.

oscarbg commented 11 years ago

Yes I'm willing to track down the issue but as you say debugging on Mac at least for me it's difficult coming from VS world.. First I'm using hackintosh but was reclutant to say it as I assumed you perhaps would lose interest on debugging on not a real Mac.. That said I played some very easy tweaks to see if that would fix it (affects entire file so all kernels in it).. first seems I have to modify RadixSort32KernelsCL.h and not RadixSort32Kernels.cl it's right? that makes already more difficult since I can't have coloring in source code.. or do you modify .cl file and then remake project? tweaks I played were (almost all are stupid things but if it fixed would have been easy): as I was getting unsigned vs int warnings in compare instructions etc.. I played with both typedef int u32 and typedef unsigned int u32 and changing some ints variables to unsigned ints changed GROUP_LDS_BARRIER barrier(CLK_LOCAL_MEM_FENCE) to GROUP_LDS_BARRIER barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE) *changed GROUP_MEM_FENCE from mem_fence to also a barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE) I will test the kernel you said but give me a few days to report results.. thanks

erwincoumans commented 11 years ago

The .cl kernels are converted to .h using a script, bullet3\build\stringify.bat or bullet3\build\stringify.sh You can force loading a .cl kernel using a boolean at creation of the kernel. This will ignoring the embedded .h file and the cached binary.

It is best to disable the caching only for the kernel you want to debug. This will force loading the kernel from disk from the .cl file.

For example in Bullet3OpenCL\ParallelPrimitives\b3RadixSort32CL.cpp:

bool disableBinaryCaching = true;
cl_program sortProg = b3OpenCLUtils::compileCLProgramFromString( ctx, device, kernelSource, &pErrNum,additionalMacros, RADIXSORT32_PATH,disableBinaryCaching );
b3Assert(sortProg);

You can also disable caching for ALL kernels using a global variable, but this will be slow when using many kernels: extern book gDebugForceLoadingFromSource; gDebugForceLoadingFromSource = true;

oscarbg commented 11 years ago

Ok, good news.. simple fix! and perf is 10% aprrox faster!

seems bug is in SortAndScatterKernel not in SortAndScatterSortDataKernel just change sort4Bits1 call to sort4Bits function seems less efficient as comment says // 4 scan, 4 exchange vs //2 scan, 2 exchange but anyway perf is even better: with sort4bits: , 24.769600 GPU ms, 338.665459 x10^6 elts/sec CORRECT with sort4bits1 27.061000 GPU ms, 309.988839 x10^6 elts/sec INCORRECT

is sort4bits faster than sort4bits1 in Windows and AMD/NV GPUs? why you weren't using it?

comments: 1) seeing my previous perf (incorrect) using sort4bits1 79xx series , 19.200000 GPU ms, 218.453333 x10^6 elts/sec and now incorrect is 27.061000 GPU ms, 309.988839 x10^6 elts/sec INCORRECT

have you changed number of elements in this sort test in experiments branch vs bullet3 branch? just for curiosity and sanity..

2) have to make any changes to other kernels for rigid body pipeline to work or it will take this kernels (i.e. change both RadixSort32KernelsCL.h and RadixSort32Kernels.cl)? 3) although correct version is faster would be good to explore or send a bug to apple.. one way to see if bug in frontend is to get Apple binary (which is portable between GPUs so must be LLVM IR) and compare vs AMD binary (which also contains LLVM IR) if IR are equal seems is a genuine backend bug and if not a fronted bug.. if it's a fronted bug would be interesting in massaging opencl kernel code for getting correct LLVM IR)

erwincoumans commented 11 years ago

Thanks a lot for tracking this down.

On my Macbook Retina the Test_OpenCL_RadixSort already worked before, with the change from sort4bits1 to sort4bits the performance goes down a bit:

sort4bits1: duration = 139.089005 , 141.469800 GPU ms, 59.296104 x10^6 elts/sec sort4bits: duration = 181.507899 GPU ms, 46.216214 x10^6 elts/sec

But it would be good to make this working for the other GPUs.

Does the GPU rigid body pipeline work with this change? (edit the .cl file and run the build/stringify.sh and recompile all)

erwincoumans commented 11 years ago

The latest Bullet 3.x GPU OpenCL pipeline is now under development in http://github.com/erwincoumans/bullet3

If the issue still exists, let's open a new issue in that repo. Thanks for your feedback and help!

erwincoumans / experiments

Issues on OpenCL version on latest Macos #11

Num Platforms = 1