doe300 / VC4C

Compiler for the VC4CL OpenCL implementation
MIT License
118 stars 37 forks source link

Removes mutex locks from most memory accesses #120

Open doe300 opened 6 years ago

doe300 commented 6 years ago

See #113

Effects (test_emulator, whole PR):

Instructions:       45160 to 49793 (+10%)
Cycles:             659247 to 641681 (-2%)
Mutex waits:        282551 to 0 (-100%)
Mutex locks:        18272 to 11 (-99%)
Total time (in ms): 56722 to 58101 (+2%)

Tests:

To be fixed:

clpeak results:

Platform: OpenCL for the Raspberry Pi VideoCore IV GPU
  Device: VideoCore IV GPU
    Driver version  : 0.4 (Linux ARM)
    Compute units   : 1
    Clock frequency : 300 MHz

    Global memory bandwidth (GBPS)
      float   : 0.07
      float2  : 0.12
      float4  : 0.22
      float8  : 0.35
      float16 : 0.51

    Single-precision compute (GFLOPS)
      float   : 0.26
      float2  : 0.74
      float4  : 1.68
      float8  : 2.83
      float16 : 4.53

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 1.13
      enqueueReadBuffer          : 0.18
      enqueueMapBuffer(for read) : 5036.31
        memcpy from mapped ptr   : 0.18
      enqueueUnmap(after write)  : 4511.52
        memcpy to mapped ptr     : 0.82

Compared to the last results (from Nov. 2017), this is:

nomaddo commented 5 years ago

I am not sure why we can remove all of mutex locks. In my view, getting mutex locks was MUST to avoid resource conflict of multi-QPU VC4 program.

You mean, the responsibility to avoid the resource conflicts is moved to the users?

doe300 commented 5 years ago

I am not sure why we can remove all of mutex locks.

I am actually also not 100% sure, but I did not meet any test so far, which failed with the mutices removed, even when I explicitly forced race-conditions between the QPUs.

In my view, getting mutex locks was MUST to avoid resource conflict of multi-QPU VC4 program.

I thought the same, but I might have been wrong.

My old theory/understanding of the VPM: There is 1 VPM unit which has the cache and manages DAM access. More specific, there is also only 1 VPM configuration (across all QPUs) active at all times, which forces the QPUs to synchronize access to not overwrite the configuration currently used by another QPU.

My new theory: Still 1 VPM unit, but a separate configuration per QPU (similar to TMU), allowing (similar to a "normal" cache for a CPU) all cores to access the cache at the same time. DMA accesses are either also parallel or a single queue which is shared between the QPUs, but this does not change anything from the usage point of view. Having parallel access allows us to access separate VPM areas (and RAM regions) unsynchronized, only forcing us to lock when we access VPM areas/RAM regions where other QPUs might access at the same time.

You mean, the responsibility to avoid the resource conflicts is moved to the users?

No, VC4C would still synchronize, where required (i.e. when accessing memory regions), but most accesses to the VPM is restructured so the QPUs do not access the same VPM ares and therefore do not need to lock.

nomaddo commented 5 years ago

Sorry to be late. I finished my vacation. From now, I can response quickly.

I tried some program to check your opinion is correct. At least, example/sgemm.py in py-videocore has a problem if mutex operations are removed. I comment-outed mutex_acquire() and mutex_release() and executed it. Without any modification, this program successfully exited. With my modification (described the above), the program got execution timeout.

pi@nomaddo-pi3:~/py-videocore/examples$ sudo python sgemm.py 
==== sgemm example (96x363 times 363x3072) ====
threads: 12
numpy: 0.1021 sec, 2.1067 Gflops
GPU: 0.0332 sec, 6.4721 Gflops
minimum absolute error: 0.0000e+00
maximum absolute error: 7.3559e+01
minimum relative error: 0.0000e+00
maximum relative error: 1.4603e+02
pi@nomaddo-pi3:~/py-videocore/examples$ sudo python sgemm.py 
Traceback (most recent call last):
  File "sgemm.py", line 565, in <module>
    main()
  File "sgemm.py", line 541, in main
    uniforms=uniforms
  File "/usr/local/lib/python2.7/dist-packages/videocore/driver.py", line 186, in execute
    raise DriverError('QPU execution timeout')
videocore.driver.DriverError: QPU execution timeout

From the expement, I think this modification is a dangerous