Fyusion-Open-Source / fyusenet

FyuseNet is an OpenGL(ES) based library that allows to run neural network inference on GPUs that support OpenGL or OpenGL/ES, which is the case for most desktop and mobile GPUs on the market.
https://fyusion.com
MIT License
39 stars 14 forks source link

Fyusenet works in OpenGL but not in GLES2 and EGL #9

Closed IssacXid closed 5 months ago

IssacXid commented 8 months ago

Good Afternoon,

I’ve been working with a target board with PowerVR GPU. The target board just has OpenGLES and EGL and doesn’t have OpenGL / glx / glfw / GLUT / GLU / GLVND etc.

I’ve been able to run the sample resnet, provided in fyusenet, successfully on a host Ubuntu system with OpenGL, glx and GLVND, but in the target board, the output of the resnet is always coming out to be the first class. I tried printing the intermediate layers to narrow down which layer is causing the issue and it came out to be the second last layer, which is defined in a file called deepgemmlayer.cpp. I used the enableIntermediateOutput() API in the fyusenet library and converting the binary raw intermediate feature maps to decimals. I think it has something to do with how the shader is getting executed by GLES-EGL combination. I've tried changing the precision values in the .vert and .frag shader programs assuming that the board hardware might not support all the operation required in the gemm layer, but it also ends up giving the same all zero output. Though the assumption was more on a test and try basis, as all the previous layer's feature maps are coming out to be non-zero floating points, similar to the host Ubuntu system.

Can anyone help me find out the issue why the shaders might be behaving in such a way that the layer’s output becomes all 0?

Thanks in advance for any help.

mtnwrw commented 6 months ago

Hi @IssacXid ,

Sorry for getting back at you so late. First of all, can you tell me what GPU and OS you are running on ? On StackOverflow I saw that you mentioned PowerVR, which model exactly ?

Also, I have a personal fork with a bunch of updates, can you reproduce the same problem on that ?

If that problem persists we can look a bit deeper and try to figure out what's going on.

IssacXid commented 6 months ago

Hey @mtnwrw,

Unfortunately, I don't have access to the target board right now. It's expected to be returned to me by Monday. Once I have the board back, I'll try reproducing the issue on your fork as well and let you know the outcome.

Target board has a IMG AXM-8-256 GPU, and OS is Linux.

Thanks for your help.

mtnwrw commented 6 months ago

I got my hands on this over the weekend.

It has the IMG BXE-4-32 MC1 GPU, which should be quite a bit slower than the one that you are using, still working on getting the GPU drivers to work properly :-)

I saw that you mentioned GLES 2.0, I would like to note that FyuseNet requires GLES 3.0 to work. Though it should be possible to get it to work with GLES 2.0 (with some adaptions on the shaders), I would not recommend it. Given that only the point-primitive layer in the end causes problems, I assume you are running under GLES 3.0 (it would have failed way earlier otherwise). I already have an idea what could be wrong, will test it here locally, otherwise I will provide instructions here (should be a one-line fix).

IssacXid commented 6 months ago

Hello, Thanks for testing it out while I couldn't access the board. I got it running, cloned and built the modified repository with some changes: 1) Setting the flag BUILD_TESTS OFF as it results in fetching gtest repository as we cannot connect the target board to internet. 2) In CMakeLists.txt, I replaced the arguement OpenGL in find_package() with OpenGLES and EGL, for which I've added the FindOpenGLES.cmake and FindEGL.cmake in the cmake directory. 3) Changed some occurences of GLES/gl.h with GLES2/gl2.h, GLES/glext.h with GLES2/gl2ext.h and GLES3/gl3ext.h with GLES3/gl2ext.h as per the files available in the target board. Before reaching to the point of replicating the deepgemmlayer.cpp, I'm trying to debug another error that I didn't face with the original fyusenet repository:

  - GL_OVR_multiview2
terminate called after throwing an instance of 'fyusion::opengl::GLException'
  what():  /home/root/github_issue/fyusenet/fyusenet/base/buffermanager.cpp:724 [fyusion::fyusenet::BufferManager::Textn
Detailed error: Cannot parameterize texture (err=0x502)

Aborted

Do you know why this is occurring? I'll try to debug it in the meantime.

mtnwrw commented 6 months ago

Hey @IssacXid ,

So to your point 2, that is actually a good idea. Would you mind setting up a PR into my fork with the changes in CMakeLists.txt and the cmake modules ?

Point 3 puzzles me a little bit. It is possible that different Linux distros handle things differently, but debian and Ubuntu put the headers into /usr/include/GLES. They do however also have copies in the other folders you mentioned. Perhaps I should do some homework on which pattern is more common among different distros and settle for a common theme here.

The error (0x502) that you see there is a long standing nuisance in my codebase which has to do with GLES and the support of RGB textures (as write targets). Don't get me started on that :-)

I set up a PR here which fixes that issue as well as another issue that I discovered when testing on the RISC-V board.

Your initial problem (the zero output in the GEMM layer) was already fixed in my fork, it was a simple line in the vertex shaders:

gl_PointSize = 1.0

Not all drivers seem to use the default (1.0) when not explicitly instructed to.

But there was another bug that only occured on weaker GPUs, which yields wrong results in convolution layers with larger kernels. That is also addressed and fixed in the PR above.

I could run the ResNet50 sample on the RISC-V board. The performance was not overwhelming (around 1800ms), but the GPU on that board is on the low end of the spectrum for PowerVR archs. So I hope that you get a bit more performance out of it.

Also the drivers that are available for my board do not seem to be really mature. For example I was not able to get any sane results when using 16-bit FP textures, I had to set the HIGH_PRECISION flag in the CMakeLists.txt to ON to make it work.

Anyway, let me know if this solves your issue.

P.S.: can you post the output of the GLInfo part of your GPU (the lengthy output at the beginning when using debug builds) ?

IssacXid commented 6 months ago

Sure thing! 🚀 PR is up! Also, regarding gl_PointSize = 1.0, can you pinpoint to the exact files for that tweak? I'll try to make it work for original fyusenet if I can get the Resnet to run.

This is the GLInfo log for the target board:

 GL version: 3.2
GLSL version: 3.20 build 1.18@6276027
GPU vendor: Imagination Technologies
GPU renderer: PowerVR A-Series AXM-8-256
Caps:
  GL_MAX_TEXTURE_SIZE: 16384
  GL_MAX_VERTEX_ATTRIBS: 16
  GL_MAX_VERTEX_UNIFORM_VECTORS: 1024
  GL_MAX_VARYING_VECTORS: 15
  GL_MAX_VERTEX_OUTPUT_COMPONENTS: 64
  GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS: 144
  GL_MAX_VERTEX_TEXTURE_IMAGE_UNITS: 24
  GL_MAX_TEXTURE_IMAGE_UNITS: 24
  GL_MAX_FRAGMENT_UNIFORM_VECTORS: 1024
  GL_MAX_FRAGMENT_INPUT_COMPONENTS: 60
  GL_MAX_COLOR_ATTACHMENTS: 8
  GL_MAX_DRAW_BUFFERS: 8
  GL_MAX_FRAGMENT_UNIFORM_COMPONENTS: 4096
  GL_MAX_VERTEX_UNIFORM_COMPONENTS: 4096
  (I) GL_FRAGMENT_LOW: [15 15]
  (I) GL_FRAGMENT_MEDIUM: [15 15]
  (I) GL_FRAGMENT_HIGH: [31 31]
  (F) GL_FRAGMENT_LOW: [1 1] 8
  (F) GL_FRAGMENT_MEDIUM: [14 14] 10
  (F) GL_FRAGMENT_HIGH: [127 127] 23
  GL_MAX_UNIFORM_BUFFER_BINDINGS: 72
  GL_MAX_VERTEX_UNIFORM_BLOCKS: 12
  GL_MAX_FRAGMENT_UNIFORM_BLOCKS: 12
  GL_MAX_COMBINED_UNIFORM_BLOCKS: 72
  GL_MAX_UNIFORM_BLOCK_SIZE: 134217728
  GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT: 128
  GL_MIN_PROGRAM_TEXEL_OFFSET: -8
  GL_MAX_PROGRAM_TEXEL_OFFSET: 7
  GL_MAX_COMBINED_VERTEX_UNIFORM_COPONENTS: 402657280
  GL_MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS: 402657280
  GL_MAX_COMPUTE_IMAGE_UNIFORMS: 24
  GL_MAX_COMPUTE_SHADER_STORAGE_BLOCKS: 35
  GL_MAX_COMPUTE_SHADER_STORAGE_BLOCKS: 24
  GL_MAX_COMPUTE_SHADER_STORAGE_BLOCKS: 0
  GL_MAX_COMPUTE_UNIFORM_BLOCKS: 12
  GL_MAX_COMPUTE_UNIFORM_COMPONENTS: 1024
  GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS: 1024
  GL_MAX_COMPUTE_WORK_GROUP_SIZE: 1024 1024 1024
  GL_MAX_COMPUTE_WORK_GROUP_COUNT: 65535 65535 65535
GL_MAX_TEXTURE_BUFFER_SIZE: 65536
Extensions:
  - GL_ANDROID_extension_pack_es31a
  - GL_APPLE_texture_format_BGRA8888
  - GL_ARM_shader_framebuffer_fetch_depth_stencil
  - GL_EXT_blend_minmax
  - GL_EXT_buffer_storage
  - GL_EXT_clear_texture
  - GL_EXT_clip_control
  - GL_EXT_color_buffer_float
  - GL_EXT_color_buffer_half_float
  - GL_EXT_compressed_ETC1_RGB8_sub_texture
  - GL_EXT_conservative_depth
  - GL_EXT_copy_image
  - GL_EXT_discard_framebuffer
  - GL_EXT_draw_buffers
  - GL_EXT_draw_buffers_indexed
  - GL_EXT_draw_elements_base_vertex
  - GL_EXT_EGL_image_array
  - GL_EXT_float_blend
  - GL_EXT_geometry_point_size
  - GL_EXT_geometry_shader
  - GL_EXT_gpu_shader5
  - GL_EXT_memory_object
  - GL_EXT_memory_object_fd
  - GL_EXT_multi_draw_arrays
  - GL_EXT_multi_draw_indirect
  - GL_EXT_multisampled_render_to_texture
  - GL_EXT_multisampled_render_to_texture2
  - GL_EXT_occlusion_query_boolean
  - GL_EXT_polygon_offset_clamp
  - GL_EXT_primitive_bounding_box
  - GL_EXT_pvrtc_sRGB
  - GL_EXT_read_format_bgra
  - GL_EXT_robustness
  - GL_EXT_separate_shader_objects
  - GL_EXT_shader_framebuffer_fetch
  - GL_EXT_shader_group_vote
  - GL_EXT_shader_implicit_conversions
  - GL_EXT_shader_io_blocks
  - GL_EXT_shader_non_constant_global_initializers
  - GL_EXT_shader_pixel_local_storage
  - GL_EXT_shader_pixel_local_storage2
  - GL_EXT_shader_texture_lod
  - GL_EXT_shadow_samplers
  - GL_EXT_sparse_texture
  - GL_EXT_sRGB
  - GL_EXT_sRGB_write_control
  - GL_EXT_tessellation_point_size
  - GL_EXT_tessellation_shader
  - GL_EXT_texture_border_clamp
  - GL_EXT_texture_buffer
  - GL_EXT_texture_cube_map_array
  - GL_EXT_texture_filter_anisotropic
  - GL_EXT_texture_format_BGRA8888
  - GL_EXT_texture_format_sRGB_override
  - GL_EXT_texture_norm16
  - GL_EXT_texture_rg
  - GL_EXT_texture_shadow_lod
  - GL_EXT_texture_sRGB_decode
  - GL_EXT_texture_sRGB_R8
  - GL_EXT_texture_sRGB_RG8
  - GL_EXT_texture_type_2_10_10_10_REV
  - GL_EXT_unpack_subimage
  - GL_EXT_YUV_target
  - GL_EXT_texture_storage_compression
  - GL_IMG_framebuffer_downsample
  - GL_IMG_multisampled_render_to_texture
  - GL_IMG_program_binary
  - GL_IMG_read_format
  - GL_IMG_shader_binary
  - GL_IMG_texture_compression_pvrtc
  - GL_IMG_texture_compression_pvrtc2
  - GL_IMG_texture_format_BGRA8888
  - GL_IMG_texture_npot
  - GL_KHR_blend_equation_advanced
  - GL_KHR_blend_equation_advanced_coherent
  - GL_KHR_debug
  - GL_KHR_robustness
  - GL_KHR_texture_compression_astc_ldr
  - GL_OES_compressed_ETC1_RGB8_texture
  - GL_OES_depth24
  - GL_OES_depth_texture
  - GL_OES_depth_texture_cube_map
  - GL_OES_draw_buffers_indexed
  - GL_OES_draw_elements_base_vertex
  - GL_OES_EGL_image
  - GL_OES_EGL_image_external
  - GL_OES_EGL_image_external_essl3
  - GL_OES_EGL_sync
  - GL_OES_element_index_uint
  - GL_OES_fragment_precision_high
  - GL_OES_geometry_point_size
  - GL_OES_geometry_shader
  - GL_OES_get_program_binary
  - GL_OES_gpu_shader5
  - GL_OES_mapbuffer
  - GL_OES_packed_depth_stencil
  - GL_OES_required_internalformat
  - GL_OES_rgb8_rgba8
  - GL_OES_sample_shading
  - GL_OES_sample_variables
  - GL_OES_shader_image_atomic
  - GL_OES_shader_io_blocks
  - GL_OES_shader_multisample_interpolation
  - GL_OES_standard_derivatives
  - GL_OES_surfaceless_context
  - GL_OES_tessellation_point_size
  - GL_OES_tessellation_shader
  - GL_OES_texture_3D
  - GL_OES_texture_border_clamp
  - GL_OES_texture_buffer
  - GL_OES_texture_cube_map_array
  - GL_OES_texture_float
  - GL_OES_texture_half_float
  - GL_OES_texture_npot
  - GL_OES_texture_stencil8
  - GL_OES_texture_storage_multisample_2d_array
  - GL_OES_vertex_array_object
  - GL_OES_vertex_half_float
  - GL_OES_viewport_array
  - GL_OVR_multiview
  - GL_OVR_multiview2
IssacXid commented 6 months ago

Sorry, I clicked on the close issue button by mistake. 1) Regarding the changing of GLES3/gl3ext.h to GLES3/gl2ext.h, I had to do it as gl3ext.h was not present in the target board. Later, I also confirmed from link that:

For GLES3 there is no gl3ext.h header as defined by Khronos[1]
Even though MESA is supplying an empty one, compiling gst-plugins-bad fails
against OpenGL drivers that are not mesa-based and adhere to the Khronos
definition to only ship gl2ext.h, even for GLES3 

2) Can you tell me if there is any quick fix I can try for the 0x502 error in the meantime?

IssacXid commented 6 months ago

I made the changes in the original fyusenet repo by adding gl_PointSize = 1.0 for the files: deepconv1x1_tiled.vert, deepdefault.frag and deepdefault.vert. The deepgemmlayer is now giving non-zero outputs, but resnet is giving the same wrong class for all the examples I ran. It is now coming out to be the class with index:936. Any suggestions on the fix?

mtnwrw commented 6 months ago

That fix would be in my fork :-)

I highly recommend to go with my fork, it has a ton of updates compared to the original. The reason why those updates are not merged into the upstream repo are because of an unclear situation regarding licenses for the Llama-type LLMs (which my fork can run). I am quite sure that once those issues have been resolved, it will be merged to upstream. Also, I am the only one developing on FyuseNet, so you won't miss anything from the upstream repo for now :-)

IssacXid commented 6 months ago

Hello @mtnwrw , Thanks for informing! After building on the github branch: Fixes for deep convolutions on weak GPUs and EGL, I got this error when running sample networks:

fyusenet/gl/egl/glcontext_egl.cpp:201 [virtual void fyusion::opengl::GLContext::init()] threw GLException
Detailed error: Cannot initialize EGL extensions

Any workaround/suggestion?

Maybe attaching all the symbols of libEGL.so might help:


nm -D  /usr/lib/libEGL.so
                 U IMGeglBindAPI
                 U IMGeglBindTexImage
                 U IMGeglChooseConfig
                 U IMGeglCopyBuffers
                 U IMGeglCreateContext
                 U IMGeglCreatePbufferFromClientBuffer
                 U IMGeglCreatePbufferSurface
                 U IMGeglCreatePixmapSurface
                 U IMGeglCreateWindowSurface
                 U IMGeglDestroyContext
                 U IMGeglDestroySurface
                 U IMGeglGetConfigAttrib
                 U IMGeglGetConfigs
                 U IMGeglGetCurrentContext
                 U IMGeglGetCurrentDisplay
                 U IMGeglGetCurrentSurface
                 U IMGeglGetDisplay
                 U IMGeglGetError
                 U IMGeglGetProcAddress
                 U IMGeglInitialize
                 U IMGeglMakeCurrent
                 U IMGeglQueryAPI
                 U IMGeglQueryContext
                 U IMGeglQueryString
                 U IMGeglQuerySurface
                 U IMGeglReleaseTexImage
                 U IMGeglReleaseThread
                 U IMGeglSurfaceAttrib
                 U IMGeglSwapBuffers
                 U IMGeglSwapInterval
                 U IMGeglTerminate
                 U IMGeglWaitClient
                 U IMGeglWaitGL
                 U IMGeglWaitNative
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 w __cxa_finalize
                 w __gmon_start__
0000000000001740 T eglBindAPI
0000000000001710 T eglBindTexImage
00000000000015d0 T eglChooseConfig
00000000000016e0 T eglCopyBuffers
0000000000001640 T eglCreateContext
0000000000001730 T eglCreatePbufferFromClientBuffer
0000000000001610 T eglCreatePbufferSurface
0000000000001600 T eglCreatePixmapSurface
00000000000015f0 T eglCreateWindowSurface
0000000000001650 T eglDestroyContext
0000000000001620 T eglDestroySurface
00000000000015e0 T eglGetConfigAttrib
00000000000015c0 T eglGetConfigs
0000000000001670 T eglGetCurrentContext
0000000000001690 T eglGetCurrentDisplay
0000000000001680 T eglGetCurrentSurface
0000000000001570 T eglGetDisplay
0000000000001560 T eglGetError
00000000000015b0 T eglGetProcAddress
0000000000001580 T eglInitialize
0000000000001660 T eglMakeCurrent
0000000000001750 T eglQueryAPI
00000000000016a0 T eglQueryContext
00000000000015a0 T eglQueryString
0000000000001630 T eglQuerySurface
0000000000001720 T eglReleaseTexImage
0000000000001770 T eglReleaseThread
0000000000001700 T eglSurfaceAttrib
00000000000016d0 T eglSwapBuffers
00000000000016f0 T eglSwapInterval
0000000000001590 T eglTerminate
0000000000001760 T eglWaitClient
00000000000016b0 T eglWaitGL
00000000000016c0 T eglWaitNative
mtnwrw commented 6 months ago

Seems that your board offers none of the extensions to query multiple devices. I just added fallback code for that case which goes with the default display instead.

IssacXid commented 6 months ago

Yeah, I checked on the symbol 'eglGetProcAddress' in the libEGL.so, and it pointed to IMGeglGetProcAddress in libIMGegl.so and it also had the symbol IMGeglGetProcAddress.

objdump -D /usr/lib/libIMGegl.so | grep IMGeglGetProcAddress
0000000000010c70 <IMGeglGetProcAddress@@Base>:
   10c84:       b4000900        cbz     x0, 10da4 <IMGeglGetProcAddress@@Base+0x134>
   10cac:       b40007b5        cbz     x21, 10da0 <IMGeglGetProcAddress@@Base+0x130>
   10cd0:       14000004        b       10ce0 <IMGeglGetProcAddress@@Base+0x70>
   10d00:       340003a0        cbz     w0, 10d74 <IMGeglGetProcAddress@@Base+0x104>
   10d04:       54fffe81        b.ne    10cd4 <IMGeglGetProcAddress@@Base+0x64>  // b.any
   10d14:       54000820        b.eq    10e18 <IMGeglGetProcAddress@@Base+0x1a8>  // b.none
   10d18:       54000128        b.hi    10d3c <IMGeglGetProcAddress@@Base+0xcc>  // b.pmore
   10d1c:       34000700        cbz     w0, 10dfc <IMGeglGetProcAddress@@Base+0x18c>
   10d40:       54000501        b.ne    10de0 <IMGeglGetProcAddress@@Base+0x170>  // b.any
   10d4c:       34000340        cbz     w0, 10db4 <IMGeglGetProcAddress@@Base+0x144>
   10dc0:       54fffc81        b.ne    10d50 <IMGeglGetProcAddress@@Base+0xe0>  // b.any
   10e04:       34000160        cbz     w0, 10e30 <IMGeglGetProcAddress@@Base+0x1c0>
   10e0c:       b4000121        cbz     x1, 10e30 <IMGeglGetProcAddress@@Base+0x1c0>
   10e14:       17ffffd1        b       10d58 <IMGeglGetProcAddress@@Base+0xe8>
   10e20:       34000120        cbz     w0, 10e44 <IMGeglGetProcAddress@@Base+0x1d4>
   10e2c:       17ffffcb        b       10d58 <IMGeglGetProcAddress@@Base+0xe8>
   10e40:       17ffffda        b       10da8 <IMGeglGetProcAddress@@Base+0x138>
   10e50:       54fffea1        b.ne    10e24 <IMGeglGetProcAddress@@Base+0x1b4>  // b.any
   10e54:       17ffffdc        b       10dc4 <IMGeglGetProcAddress@@Base+0x154>

Any way of validating it won't work this way?

mtnwrw commented 6 months ago

As long as your board only advertises one EGL device anyway, there will be no difference in the outcome. There are quite a lot boards and also desktop devices that offer more than one EGL device. In many cases the "other" devices are software-emulated, for example Mesa EGL. Some laptops with "Optimus" have NVIDIA, Intel and Mesa devices etc.

IssacXid commented 6 months ago

It worked!! Thanks for adding the patch. Btw, why did you remove the inference timing pipeline?

mtnwrw commented 6 months ago

You mean in the sample apps ? Yeah, I forgot to remove those in the original release. The Engine itself gathers more detailed timings, I merged a PR which makes that interface available as part of the API, you might want to look into that, or just add those timing lines back to the samples, whatever is easier. What kind of timings do you get on your device btw ?

P.S.: I merged your PR but it needs a few amendments, will do those later and add you as reviewer.

IssacXid commented 6 months ago

For the resnet, it's coming out to ~540 ms. After enabling Engine timing, I'll share that too.

IssacXid commented 6 months ago

Got the engine timings(but the inference timing got increased a lot to ~1700 ms because of writing the result to file): Total=84804 microsec

2 3954
3 3450
4 1665
5 427
6 1182
7 1381
8 971
9 847
10 333
11 758
12 692
13 679
14 343
15 664
16 666
17 908
18 1060
19 1092
20 1428
21 750
22 304
23 993
24 1144
25 626
26 277
27 614
28 1080
29 591
30 280
31 624
32 1176
33 688
34 859
35 1399
36 1896
37 1558
38 385
39 1495
40 2072
41 1584
42 297
43 1200
44 1625
45 1558
46 306
47 1090
48 1158
49 1282
50 303
51 1220
52 1371
53 1251
54 283
55 1041
56 1550
57 1612
58 1472
59 1794
60 2472
61 2015
62 386
63 2661
64 2126
65 1658
66 366
67 1389
68 1614
69 1267
70 1351
71 2026
72 2026
73 430
mtnwrw commented 6 months ago

I suspect that the texture up/download might add a bit to the 540ms, which seems a bit slow. I think I might add a sample benchmark that uses asynchronous texture handling to get a bit closer to the actual GPU timings.

mtnwrw commented 6 months ago

@IssacXid If you have some time, please check if that works out for you and add a review:

PR #10

Also, I added a ResNet50 benchmark standalone executable. I suggest you run it like this:

resnet_bench --sync --warmup 5 -r 5 -w <weights> <input_img>

That should give a rough estimate what the GPU itself is able to do on that net.

mtnwrw commented 5 months ago

PR is merged. Closing ticket.