opencl acceleration for csc and/or encoding

totaam commented 11 years ago

Issue migrated from trac ticket # 422

component: core | priority: major | resolution: fixed

2013-08-26 08:45:01: totaam created the issue

References:

OpenCL (wikipedia)

How to set up OpenCL in Linux

Fast RGB => YUV conversion in OpenCL

AMD SDK

pyopencl

GPGPU System for Intel Ivybridge GPUs

Intel SDK

Testing OpenCL Accelerated Handbrake with AMD's Trinity ..OpenCL/GPU acceleration for video scaling and color space conversion, and OpenCL/GPU acceleration of the lookahead function of the x264 encoding process.

Parallelization of the x264 encoder using OpenCL

totaam commented 11 years ago

2013-08-26 15:43:17: totaam uploaded file `add-csc-opencl.patch` (13.7 KiB)

stub opencl csc module

totaam commented 11 years ago

2013-08-27 17:26:14: totaam uploaded file `add-csc-opencl-v3.patch` (19.7 KiB)

minor tweaks

totaam commented 11 years ago

2013-08-27 17:33:36: totaam changed status from new to assigned

totaam commented 11 years ago

2013-08-27 17:33:36: totaam changed owner from antoine to totaam

totaam commented 11 years ago

2013-08-27 17:33:36: totaam commented

More kernels we may be able to use:

image_formats.cl from socles (GPL v3)

totaam commented 11 years ago

2013-08-28 08:05:10: totaam commented

Testing with plain x264 command line (running a couple of times to ensure the values are consistent - they are..):
OpenCL enabled:
$ time ./x264 --opencl  -o opencl.x264  video.mp4 
lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr)
x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT
x264 [info]: OpenCL acceleration enabled with NVIDIA Corporation GeForce GTS 450 
x264 [info]: profile High, level 3.0
x264 [info]: frame I:364   Avg QP:15.09  size: 37254                           
x264 [info]: frame P:10936 Avg QP:20.31  size:  5108
x264 [info]: frame B:19868 Avg QP:23.11  size:   772
x264 [info]: consecutive B-frames: 10.2% 11.5%  8.4% 69.9%
x264 [info]: mb I  I16..4: 29.4% 17.4% 53.2%
x264 [info]: mb P  I16..4:  2.0%  2.6%  3.3%  P16..4: 11.9%  6.5%  4.6%  0.0%  0.0%    skip:69.2%
x264 [info]: mb B  I16..4:  0.1%  0.1%  0.2%  B16..8:  8.5%  2.2%  0.8%  direct: 0.7%  skip:87.4%  L0:48.4% L1:45.2% BI: 6.5%
x264 [info]: 8x8 transform intra:28.0% inter:27.9%
x264 [info]: coded y,uvDC,uvAC intra: 37.6% 57.8% 45.3% inter: 3.7% 4.7% 2.0%
x264 [info]: i16 v,h,dc,p: 64% 27%  8%  2%
x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 19% 15% 58%  1%  1%  1%  1%  1%  2%
x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 23%  4%  4%  4%  5%  4%  4%
x264 [info]: i8c dc,h,v,p: 51% 24% 22%  4%
x264 [info]: Weighted P-Frames: Y:1.1% UV:1.0%
x264 [info]: ref P L0: 64.6%  7.0% 17.6% 10.7%  0.1%
x264 [info]: ref B L0: 79.6% 17.1%  3.3%
x264 [info]: ref B L1: 95.0%  5.0%
x264 [info]: kb/s:521.59
encoded 31168 frames, 175.77 fps, 521.59 kb/s

real 2m57.650s user 10m12.278s sys 0m36.051s
- without `OpenCL`:
$ time ./x264 -o no-opencl.x264 video.mp4 lavf [info]: 720x404p 0:1 @ 24000/1001 fps (vfr) x264 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT x264 [info]: profile High, level 3.0 x264 [info]: frame I:373 Avg QP:16.18 size: 36484
x264 [info]: frame P:12582 Avg QP:20.97 size: 4720 x264 [info]: frame B:18213 Avg QP:23.12 size: 681 x264 [info]: consecutive B-frames: 17.9% 10.8% 5.7% 65.7% x264 [info]: mb I I16..4: 23.1% 24.5% 52.5% x264 [info]: mb P I16..4: 1.6% 2.4% 2.8% P16..4: 11.8% 6.5% 4.5% 0.0% 0.0% skip:70.5% x264 [info]: mb B I16..4: 0.1% 0.1% 0.2% B16..8: 7.6% 1.9% 0.7% direct: 0.6% skip:88.8% L0:47.1% L1:46.3% BI: 6.6% x264 [info]: 8x8 transform intra:31.5% inter:27.4% x264 [info]: coded y,uvDC,uvAC intra: 36.9% 56.1% 43.1% inter: 3.9% 4.9% 2.1% x264 [info]: i16 v,h,dc,p: 61% 29% 8% 2% x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 20% 15% 58% 1% 1% 1% 1% 1% 2% x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 30% 22% 22% 4% 4% 4% 5% 4% 4% x264 [info]: i8c dc,h,v,p: 51% 23% 22% 4% x264 [info]: Weighted P-Frames: Y:0.8% UV:0.7% x264 [info]: ref P L0: 64.9% 6.8% 17.7% 10.5% 0.0% x264 [info]: ref B L0: 78.6% 18.1% 3.3% x264 [info]: ref B L1: 95.4% 4.6% x264 [info]: kb/s:525.55

encoded 31168 frames, 186.50 fps, 525.55 kb/s

real 2m47.235s user 10m10.138s sys 0m6.067s
Resulting files:
$ du -sk *opencl.x264 83404 no-opencl.x264 82776 opencl.x264
So this doesn't look like it makes much of a difference unfortunately (at least on my `GTS 450`), if anything it is a tad slower.

The one thing where this may still be useful is for motion detection, where we could increase the search diameter without incurring too much more CPU usage.

Enabling it looks simple enough, in `x264.h`:
int b_opencl; / use OpenCL when available /
(assuming that x264 is built with opencl support)

totaam commented 11 years ago

2013-08-28 09:16:23: totaam edited the issue description

totaam commented 11 years ago

2013-08-28 09:16:23: totaam commented

For the record, this is what I had to do to get pyopencl to build on Fedora 19 with the nvidia SDK to avoid this error at import time:
ImportError: /usr/lib/python2.7/dist-packages/pyopencl/_cl.so: \
    symbol clRetainDevice, version OPENCL_1.2 not defined in file libOpenCL.so.1 with link time reference
The existing headers look like this:
$ ls -la /usr/include/CL
lrwxrwxrwx. 1 root root 32 Aug 28 12:39 /usr/include/CL -> /etc/alternatives/opencl-headers
Edit: Just downgrading the version of opencl-headers to 1.1 is enough.

Alternatively, we can move the headers to a version specific directory and add the OpenCL 1.1 headers:
cd /etc/alternatives/
mv opencl-headers opencl-headers-1.2
mkdir opencl-headers-1.1
ln -sf opencl-headers-1.1 opencl-headers
cd opencl-headers-1.1

wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl_ext.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_gl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl.h
wget http://www.khronos.org/registry/cl/api/1.1/cl_platform.h
wget http://www.khronos.org/registry/cl/api/1.1/opencl.h
Then we need to ensure pyopengl will be built against 1.1, so siteconf.py contains:
CL_PRETEND_VERSION = '1.1'

totaam commented 11 years ago

2013-08-28 09:19:16: totaam commented

Having installed freeocl, I now have 3 providers available:

$ LD_LIBRARY_PATH=/opt/cuda/lib64/ XPRA_SWSCALE_DEBUG=0 PYTHONPATH=. python ./tests/xpra/codecs/test_csc_opencl.py 
PyOpenCL OpenGL support: True
found 3 OpenCL platforms:
* FreeOCL (FreeOCL developers) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 FreeOCL-0.3.6 / OpenCL C 1.2)
* NVIDIA CUDA (NVIDIA Corporation) - 1 devices:
 + GPU: GeForce GTS 450 (OpenCL 1.1 CUDA / OpenCL C 1.1 )
* Intel(R) OpenCL (Intel(R) Corporation) - 1 devices:
 + CPU: AMD Phenom(tm) II X4 945 Processor (OpenCL 1.2 (Build 67279) / OpenCL C 1.2 )

totaam commented 11 years ago

2013-08-28 16:55:46: totaam uploaded file `add-csc-opencl-v6.patch` (22.7 KiB)

works ok but only one format so far: YUV420P to RGB

totaam commented 11 years ago

2013-08-28 17:26:18: totaam commented

Please try the patch above and report on performance. You may need to adjust some env vars for finding the libraries in the cuda paths and for selecting the opencl platform/device:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/cuda/lib64/
export PYTHONPATH=.
XPRA_OPENCL_DEVICE_TYPE=GPU python ./tests/xpra/codecs/test_csc_opencl.py
XPRA_OPENCL_DEVICE_TYPE=CPU python ./tests/xpra/codecs/test_csc_opencl.py 
Note: careful with LD_LIBRARY_PATH, putting cuda ahead of regular libraries can cause some serious problems (conflicts with libopencl versions for example).

[[BR]]

Results deleted (those figures were wrong because of a bug)

The results aren't as bad as they look for nvidia:

cpu csc is already very fast since it is such as simple operation

hopefully the difference will be more noticeable when we add scaling

the gfx card is quite slow by modern standards (we'll see if faster ones help - not guaranteed it will make a huge difference here since the cost is mostly memory bandwidth)

most of the cpu time is spent copying buffers to and from the gfx card and on modern cpus that is slightly better than doing fpu or more general instruction decoding

Even then, I think there is room for improvement since we copy the pixels in and out and we may not need to (we just need a buffer interface).

Interestingly, the performance varies widely depending on the picture size.. will need to look into the worksize/localsize settings.

totaam commented 11 years ago

2013-08-28 17:27:30: totaam uploaded file `add-csc-opencl-v7.patch` (23.0 KiB)

updated patch - fix crash with swscale

totaam commented 11 years ago

2013-08-28 17:45:37: smo commented

Here are the results on Nvidia K1 (Nvidia) OpenCL

At 1920x1080 191 MPixels/s 223 MPixels/s 161 MPixels/s 184 MPixels/s 172 MPixels/s

totaam commented 11 years ago

2013-08-29 17:18:43: totaam uploaded file `add-csc-opencl-v10.patch` (17.3 KiB)

working version with all yuv formats as input and both BGRX and RGBX as output

totaam commented 11 years ago

2013-08-29 17:22:05: totaam changed status from assigned to new

totaam commented 11 years ago

2013-08-29 17:22:05: totaam changed owner from totaam to smo

totaam commented 11 years ago

2013-08-29 17:22:05: totaam commented

Please re-run with patch v10 which fixes some important bugs.

I am afraid that I cannot commit it as-is because the OpenCL shared libraries we end up loading cause some serious problems:

Traceback (most recent call last):
  File "/usr/bin/xpra", line 6, in <module>
    sys.exit(xpra.scripts.main.main(__file__, sys.argv))
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/main.py", line 432, in main
    return run_server(parser, options, mode, script_file, args)
  File "/usr/lib64/python2.7/site-packages/xpra/scripts/server.py", line 454, in run_server
    import gtk.gdk          #@Reimport
  File "/usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py", line 40, in <module>
    from gtk import _gtk
ImportError: dlopen: cannot load any more object with static TLS

totaam commented 11 years ago

2013-08-30 15:04:00: antoine uploaded file `add-csc-opencl-v13.patch` (35.6 KiB)

updated patch with support for RGB to YUV444P (and more to come)

totaam commented 11 years ago

2013-08-31 06:17:48: antoine changed status from new to assigned

totaam commented 11 years ago

2013-08-31 06:17:48: antoine changed owner from smo to antoine

totaam commented 11 years ago

2013-08-31 06:17:48: antoine commented

Added support in r4247

According to Recommended 8-Bit YUV Formats for Video Rendering (section on "YUV Sampling"), MPEG2's subsampling code (BT.601) is more lazy than MPEG1's - but since OpenCL is so cheap to run (it is the memory transfers that cost us), I went for the MPEG1-like more exhaustive calculations instead (using an average of all source pixel values).

Still have to figure out the TLS issue before this can be of any use..

totaam commented 11 years ago

2013-09-04 12:51:04: antoine commented

Testing on a dual Xeon E5-2670 with dual NVidia K1s (more results [/wiki/CSC here]), I found that the individual K1 GPU cores are actually slower than my GTS 450 and so using OpenCL with x264 actually makes it run slower (and I believe the CPU savings are not worth much either):
without OpenCL:
encoded 3347 frames, 148.74 fps, 1853.13 kb/s
real 0m22.759s user 6m40.754s sys 0m7.133s
* with `OpenCL`:
encoded 3347 frames, 89.80 fps, 1866.38 kb/s

real 0m46.335s user 4m42.685s sys 0m26.054s

totaam commented 11 years ago

2013-09-06 13:53:09: antoine changed status from assigned to closed

totaam commented 11 years ago

2013-09-06 13:53:09: antoine changed resolution from * to fixed*

totaam commented 11 years ago

2013-09-06 13:53:09: antoine commented

The TLS issue has been solved in r4282 by only properly initializing csc_opencl (getting a context) after we have loaded GTK... which works around the problem rather than solving it properly.

OpenCL is now enabled (r4298) and working well so closing this ticket.

Note: we may still want some enhancements:

handle more modes with generated kernel byteswapping for channel modes not handled by the runtime library (easy)

handle scaling (big!)

debug kernel build errors with FreeOCL and pocl

totaam commented 11 years ago

2013-10-07 09:45:59: totaam commented

scaling was added in r4310

generating missing rgb modes was added in r4303

See also #437

totaam commented 10 years ago

2013-10-15 13:19:02: totaam commented

There were many more changes and tweaks (too many to list).

Note: the TLS issue is discussed here on the PyOpenCL mailing list. Looks like a PyOpenCL build issue - may need to revisit when testing with the Nvidia SDK which only supports OpenCL 1.1 ...

totaam commented 10 years ago

2013-10-18 04:45:45: totaam changed status from closed to reopened

totaam commented 10 years ago

2013-10-18 04:45:45: totaam changed resolution from fixed to **

totaam commented 10 years ago

2013-10-18 04:45:45: totaam commented

Just found that the the AMD icd causes the client to get into a spin and waste CPU on a spinlock. Simply having the AMD icd in /etc/OpenCL/vendors is enough to trigger the problem, so OpenCL should probably be disabled by default to prevent this. What is really odd is that this only affects the client, the server will happily run with the AMD icd (you can force it to be used with: XPRA_FORCE_CSC_MODE=YUV420P XPRA_CSC_TYPE=opencl xpra start ...) We cannot do a runtime check as calling any OpenCL API will cause the loader to dlopen the problematic library.. and we're toast.

Beware: one cannot strace the xpra client (the machine locks up - need ssh to come and kill the strace process)

Here's what strace has to say:

open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
read(10, "0-7\n", 8192)                 = 4
close(10)                               = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f78e9007000
mprotect(0x7f78e9007000, 4096, PROT_NONE) = 0
clone(Process 2797 attached
 <unfinished ...>
[pid  2797] set_robust_list(0x7f78e98079e0, 24 <unfinished ...>
[pid  2655] <... clone resumed> child_stack=0x7f78e9806fb0, \
    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, \
    parent_tidptr=0x7f78e98079d0, tls=0x7f78e9807700, child_tidptr=0x7f78e98079d0) = 2797
[pid  2797] <... set_robust_list resumed> ) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0x4008642a <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2655] <... ioctl resumed> , 0x7fff7aabbb08) = 0
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff <unfinished ...>
[pid  2655] ioctl(9, 0xc03064a6 <unfinished ...>
[pid  2797] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
[pid  2797] futex(0x347b040, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {0, 1000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

The futex call repeats forever and the xpra client process consumes >70% CPU doing absolutely nothing.

totaam commented 10 years ago

2013-11-11 09:59:47: totaam edited the issue description

totaam commented 10 years ago

2013-12-05 16:15:41: totaam commented

And another one for good measure, Intel this time, is doing an illegal memory access, caught with valgrind:

==27195## Invalid read of size 827195##    at 0x118DDA1C: __intel_sse2_strrchr (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x118C8531: tbb::internal::init_dl_data() (dynamic_link.cpp:290)27195##    by 0x118C8466: __sti__$E (dynamic_link.cpp:449)27195##    by 0x118E8001: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x118C367A: ??? (in /opt/intel/opencl-1.2-3.0.67279/lib64/libtbb_preview.so.2)27195##    by 0x7FF000276: ???27195##    by 0x6E6F687479702E: ???27195##    by 0x6E69622F7273752E: ???27195##    by 0x746100617270782E: ???27195##    by 0x652D2D0068636173: ???27195##    by 0x3D676E69646F636D: ???27195##    by 0x6E2D2D0034363267: ???27195##  Address 0xec4c5d8 is 56 bytes inside a block of size 58 alloc'd27195##    at 0x4A06409: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)27195##    by 0x3452405C95: open_path (dl-load.c:2036)27195##    by 0x34524086DC: _dl_map_object (dl-load.c:2223)27195##    by 0x345240CAD1: openaux (dl-deps.c:63)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195##    by 0x345240D1D1: _dl_map_object_deps (dl-deps.c:256)27195##    by 0x34524138BB: dl_open_worker (dl-open.c:265)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195##    by 0x34524131EA: _dl_open (dl-open.c:656)27195##    by 0x3452C0102A: dlopen_doit (dlopen.c:66)27195##    by 0x345240F303: _dl_catch_error (dl-error.c:177)27195==    by 0x3452C0162C: _dlerror_run (dlerror.c:163)

totaam commented 10 years ago

2013-12-10 09:01:27: totaam changed status from reopened to new

totaam commented 10 years ago

2013-12-10 09:01:27: totaam changed owner from antoine to SmO

totaam commented 10 years ago

2013-12-10 09:01:27: totaam commented

I have added the most important setup and configuration information here: CSC and the performance data now lives here: CSC

There are new SDKs available:

Intel SDK XE 2013 R2 - which I am unable to test on my AMD CPU, can you please check that it still runs OK and maybe add or update the [/wiki/CSC/Performance performance data] (hopefully they will have fixed the invalid 64-bit memory access from comment:15 - if you have time, run the minimal opencl tests under valgrind)

AMD APP SDK v2.9 - and I can no longer reproduce the client problems.

[[BR]]

Maybe this can be enabled by default server side?

I don't think we will ever bother using OpenCL or nvcuda (#384) for CSC on the client side, since we're better off using OpenGL for CSC, scaling and rendering (it is now stable enough to use).

totaam commented 10 years ago

2013-12-20 00:46:54: smo commented

I've tested the Intel, AMD and Nvidia OpenCL ICD's and tested with no problem however there is an issue with the AMD ICD which prevents Xorg from receiving a kill signal. Even just having this ICD available seems to be enough to trigger it.

I'm going to work from a clean install and try to find a set of instructions that includes all the above info to install the Intel + Nvidia ICD's on Fedora 20 to work with xpra.

totaam commented 10 years ago

2014-01-04 05:35:17: totaam commented

I've just hit this error:
clFinish failed: invalid command queue
After a computer suspend-resume, it seems that the context becomes invalid (must have been cleared from the GPU during suspend). r5110 fixes that.

[[BR]]

Quite likely to affect nvenc (added to #466) and csc_nvcuda (added to #384)

totaam commented 10 years ago

2014-01-09 00:29:31: smo commented

Trying to test with AMD OpenCL using HD 6870 GPU

Getting some strange output is this normal?

using new OpenCL context
YUV420P to BGRX    at  1920x1080        : 90 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV420P to RGBX    at  1920x1080        : 128 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to BGRX    at  1920x1080        : 113 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV422P to RGBX    at  1920x1080        : 131 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to BGRX    at  1920x1080        : 141 MPixels/s
using new OpenCL context
using new OpenCL context
using new OpenCL context
using new OpenCL context
YUV444P to RGBX    at  1920x1080        : 112 MPixels/s

Seems to be starting many new contexts.

totaam commented 10 years ago

2014-01-09 00:59:47: smo commented

Tested a few suspend/resume with r5153 with an ATI HD6870 and no issue.

2014-01-08 17:55:44,912 PyOpenCL loaded, header version: 1.2, GL support: False
2014-01-08 17:55:44,913  using platform: AMD Accelerated Parallel Processing (Advanced Micro Devices, Inc.)
2014-01-08 17:55:44,913  using device: GPU: Barts (OpenCL 1.2 AMD-APP (1348.4) / OpenCL C 1.2 )

Fore more info

totaam commented 10 years ago

2014-01-09 02:08:07: totaam commented

From comment:20: that's odd, are you not seeing any using new OpenCL context after suspend/resume as I was? (I will try an intel chipset too) The patch [/attachment/ticket/422/opencl-forcewait.patch] makes it easier to hit the context problems: adding a 10 second delay in the encoding so that we can more easily suspend a PC whilst the GPU context is active.

[[BR]]

Also, the log from comment:19 is worrying: the context should not have changed during the same run and I don't see how it could.. r5154 will tell us what has changed (the context or "program"), if you still get multiple occurrences of using new OpenCL context during the test run, please run the test with XPRA_OPENGL_DEBUG=1 and post the lines preceding these ones, they should read something like: old program=(..), new program=(..) or old context=(..), new context=(..).

totaam commented 10 years ago

2014-01-09 02:21:04: totaam uploaded file `opencl-forcewait.patch` (0.5 KiB)

introduces a 10 second delay in the encoding to make it easier to suspend with a live context

totaam commented 10 years ago

2014-01-09 05:08:11: smo commented

For comment:20

init_context(..) channel order=RGBA, filter mode=NEAREST
init_context(..) kernel_function RGB_to_YUV422P: <pyopencl._cl.Kernel object at 0x3300628>
old program=<pyopencl.Program object at 0x2e21510>, new program=<pyopencl.Program object at 0x2e21510>
using new OpenCL context (program changed)
init_context(..) kernel source=

totaam commented 10 years ago

2014-01-09 06:12:38: totaam uploaded file `opencl-programcompare.patch` (0.9 KiB)

try to use the underlying int_ptr to compare opencl program instances

totaam commented 10 years ago

2014-01-09 06:16:46: totaam commented

What the? the programs are clearly the same... yet fail the comparison test.

Looks like the docs are wrong: pyopencl.Program: Instances of this class are hashable, and two instances of this class may be compared using “==” and ”!=”. (Hashability was added in version 2011.2.) (unless you are using an outdated version of PyOpenCL?)

Can you please try once more with [/attachment/ticket/422/opencl-programcompare.patch] to see if the spurious using new OpenCL context still occur? (and post your version of the PyOpenCL package) The easy alternative, would be to remove the program test altogether, I have manually verified that we always re-initialize the programs when we re-initialize the device so this would be safe, for now. But this would make the code much more brittle.

totaam commented 10 years ago

2014-01-09 15:34:35: smo commented

Odd pyopencl seems to be installed 32 bit??

Using /usr/lib/python2.7/site-packages/pyopencl-2013.2-py2.7-linux-x86_64.egg I installed this with easy_install -Z pyopencl I may have to do it by hand we'll see.

I applied your patch and they seem to be all gone now.

totaam commented 10 years ago

2014-01-09 15:42:46: totaam commented

OK, I'll try to produce a test case to report the bug to PyOpenCL, which I will have to ask you to test for me since I can't reproduce this weirdness. In the meantine, r5157 merges the workaround with a long comment explaining its purpose.

FYI: /usr/lib/python2.7/site-packages/ can contain both 32-bit and 64-bit extensions..

totaam commented 10 years ago

2014-01-09 15:46:16: smo commented

Thanks for the clarification. I'll update the performance chart with my numbers from this machine and a quick instruction set for being able to run it.

AMD drivers require some extra stuff like exporting COMPUTE=:0 so I assume you actually have to have an X server running?

That said I think we've tried out opencl_csc on several platforms now and several opencl ICD's

totaam commented 10 years ago

2014-01-09 23:42:57: smo commented

Install AMD OpenCL on Fedora 20

I did this from a fresh install with LXDE

From a root terminal
yum group install "Development Tools"; yum install kernel-devel opencl-headers gcc-c++

cd /tmp
wget http://www2.ati.com/drivers/beta/amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip

unzip amd-catalyst-13.11-betaV9.95-linux-x86.x86_64.zip
chmod +x Install-AMD-APP.sh; ./Install-AMD-APP.sh
I chose to do an express install. It may ask you to reboot I chose to do this after I installed the AMD App SDK.

Download AMD-APP-SDK-v2.9-lnx64.tgz from http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/
tar xfvz ../AMD-APP-SDK-v2.9-lnx64.tgz
./Install-AMD-App.sh
I rebooted after this install and proceed to install pyopencl with easyinstall
easy_install -Z pyopencl
Started and tested xpra with this command line
COMPUTE=:0 XPRA_OPENCL_DEVICE_TYPE=GPU xpra --no-daemon --bind-tcp=0.0.0.0:1300 --start-child="xterm -fg white -bg black" start :13

totaam commented 10 years ago

2014-02-12 19:18:47: smo changed status from new to closed

totaam commented 10 years ago

Xpra-org / xpra