dropbox / lepton

Lepton is a tool and file format for losslessly compressing JPEGs by an average of 22%.
https://blogs.dropbox.com/tech/2016/07/lepton-image-compression-saving-22-losslessly-from-images-at-15mbs/
Apache License 2.0
5.01k stars 355 forks source link

add __ARM_NEON support #157

Open m6w6 opened 2 years ago

m6w6 commented 2 years ago
CLAassistant commented 2 years ago

CLA assistant check
All committers have signed the CLA.

m6w6 commented 2 years ago

Performance gain is not huge currently, but most of the instructions are just literally translated, yet, and not iterated upon.

lepton{,-scalar} -benchmark results on an M1 mini (2020):

lepton-scalar                                                                   lepton (neon)                                                            
BENCHMARK: 16 trials                                                            BENCHMARK: 16 trials                                                            
 217.37ms ( 90.87Mbit/s) : Verified encode                                       212.32ms ( 93.04Mbit/s) : Verified encode                                      
 156.15ms (126.50Mbit/s) : Unverified encode                                     156.51ms (126.21Mbit/s) : Unverified encode                                    
  63.80ms (309.60Mbit/s) : decode                                                 61.57ms (320.82Mbit/s) : decode                                               
 609.13ms ( 32.43Mbit/s) : Single threaded Verified encode                       600.13ms ( 32.92Mbit/s) : Single threaded Verified encode                      
 328.37ms ( 60.15Mbit/s) : Single threaded Unverified encode                     324.12ms ( 60.94Mbit/s) : Single threaded Unverified encode                    
 283.65ms ( 69.64Mbit/s) : Single threaded decode                                277.37ms ( 71.22Mbit/s) : Single threaded decode                               
 304.66ms ( 64.84Mbit/s) : Loaded 2 Verified encode                              292.47ms ( 67.54Mbit/s) : Loaded 2 Verified encode                             
 203.42ms ( 97.10Mbit/s) : Loaded 2 Unverified encode                            196.53ms (100.51Mbit/s) : Loaded 2 Unverified encode                           
 107.45ms (183.84Mbit/s) : Loaded 2 decode                                       110.84ms (178.21Mbit/s) : Loaded 2 decode                                      
 491.56ms ( 40.18Mbit/s) : Loaded 4 Verified encode                              488.95ms ( 40.40Mbit/s) : Loaded 4 Verified encode                             
 288.03ms ( 68.58Mbit/s) : Loaded 4 Unverified encode                            286.09ms ( 69.04Mbit/s) : Loaded 4 Unverified encode                           
 223.50ms ( 88.38Mbit/s) : Loaded 4 decode                                       213.59ms ( 92.48Mbit/s) : Loaded 4 decode                                      
 669.68ms ( 29.50Mbit/s) : Loaded 6 Verified encode                              690.89ms ( 28.59Mbit/s) : Loaded 6 Verified encode                             
 394.21ms ( 50.11Mbit/s) : Loaded 6 Unverified encode                            393.49ms ( 50.20Mbit/s) : Loaded 6 Unverified encode                           
 322.71ms ( 61.21Mbit/s) : Loaded 6 decode                                       313.99ms ( 62.91Mbit/s) : Loaded 6 decode                                      
 935.54ms ( 21.11Mbit/s) : Loaded 8 Verified encode                              890.26ms ( 22.19Mbit/s) : Loaded 8 Verified encode                             
 498.29ms ( 39.64Mbit/s) : Loaded 8 Unverified encode                            498.52ms ( 39.62Mbit/s) : Loaded 8 Unverified encode                           
 445.81ms ( 44.31Mbit/s) : Loaded 8 decode                                       417.14ms ( 47.35Mbit/s) : Loaded 8 decode                                      
1425.39ms ( 13.86Mbit/s) : Loaded 12 Verified encode                            1377.82ms ( 14.34Mbit/s) : Loaded 12 Verified encode                            
 770.91ms ( 25.62Mbit/s) : Loaded 12 Unverified encode                           767.00ms ( 25.75Mbit/s) : Loaded 12 Unverified encode                          
 663.58ms ( 29.77Mbit/s) : Loaded 12 decode                                      638.00ms ( 30.96Mbit/s) : Loaded 12 decode                                     
1853.21ms ( 10.66Mbit/s) : Loaded 16 Verified encode                            2211.84ms (  8.93Mbit/s) : Loaded 16 Verified encode                            
1042.54ms ( 18.95Mbit/s) : Loaded 16 Unverified encode                          1036.09ms ( 19.07Mbit/s) : Loaded 16 Unverified encode                          
 918.81ms ( 21.50Mbit/s) : Loaded 16 decode                                      921.06ms ( 21.45Mbit/s) : Loaded 16 decode                                     
Backfill verified encode bandwidth 162.22 Mbit/s [12 threads]                   Backfill verified encode bandwidth 170.17 Mbit/s [8 threads]                    
Backfill unverified encode bandwidth 301.64 Mbit/s [8 threads]                  Backfill unverified encode bandwidth 315.18 Mbit/s [8 threads]                  
Backfill decode bandwidth 353.71 Mbit/s [6 threads]                             Backfill decode bandwidth 371.61 Mbit/s [6 threads]  

~I also noticed that some "legacy" tests are failing, thus the "WIP/Draft" status of this PR.~

EDIT: typos; resolved

m6w6 commented 2 years ago

~Also, while 100% restoring e.g. hq.jpg, its .lep vastly differs from that of lepton-scalar and is about 1k bigger.~

EDIT: resolved

m6w6 commented 2 years ago

Benchmark on C6g.4xlarge with clang-10 and ARCH_FLAGS=-mcpu=neoverse-n1

lepton-scalar                                                           lepton (neon)
BENCHMARK: 16 trials                                                    BENCHMARK: 16 trials
 333.65ms ( 59.20Mbit/s) : Verified encode                               329.43ms ( 59.96Mbit/s) : Verified encode
 258.18ms ( 76.51Mbit/s) : Unverified encode                             256.96ms ( 76.87Mbit/s) : Unverified encode
  75.78ms (260.66Mbit/s) : decode                                         72.83ms (271.24Mbit/s) : decode
1118.17ms ( 17.67Mbit/s) : Single threaded Verified encode              1072.64ms ( 18.42Mbit/s) : Single threaded Verified encode
 630.89ms ( 31.31Mbit/s) : Single threaded Unverified encode             611.47ms ( 32.30Mbit/s) : Single threaded Unverified encode
 488.95ms ( 40.40Mbit/s) : Single threaded decode                        459.32ms ( 43.01Mbit/s) : Single threaded decode
 340.60ms ( 58.00Mbit/s) : Loaded 2 Verified encode                      336.53ms ( 58.70Mbit/s) : Loaded 2 Verified encode
 261.09ms ( 75.66Mbit/s) : Loaded 2 Unverified encode                    261.56ms ( 75.52Mbit/s) : Loaded 2 Unverified encode
  78.10ms (252.91Mbit/s) : Loaded 2 decode                                78.74ms (250.88Mbit/s) : Loaded 2 decode
 435.00ms ( 45.41Mbit/s) : Loaded 4 Verified encode                      417.49ms ( 47.31Mbit/s) : Loaded 4 Verified encode
 311.89ms ( 63.33Mbit/s) : Loaded 4 Unverified encode                    308.04ms ( 64.13Mbit/s) : Loaded 4 Unverified encode
 131.23ms (150.52Mbit/s) : Loaded 4 decode                               131.77ms (149.91Mbit/s) : Loaded 4 decode
 560.87ms ( 35.22Mbit/s) : Loaded 6 Verified encode                      539.63ms ( 36.60Mbit/s) : Loaded 6 Verified encode
 366.90ms ( 53.84Mbit/s) : Loaded 6 Unverified encode                    357.37ms ( 55.27Mbit/s) : Loaded 6 Unverified encode
 220.63ms ( 89.53Mbit/s) : Loaded 6 decode                               214.97ms ( 91.89Mbit/s) : Loaded 6 decode
 748.58ms ( 26.39Mbit/s) : Loaded 8 Verified encode                      725.59ms ( 27.22Mbit/s) : Loaded 8 Verified encode
 455.86ms ( 43.33Mbit/s) : Loaded 8 Unverified encode                    449.56ms ( 43.94Mbit/s) : Loaded 8 Unverified encode
 269.47ms ( 73.30Mbit/s) : Loaded 8 decode                               270.47ms ( 73.03Mbit/s) : Loaded 8 decode
 947.95ms ( 20.84Mbit/s) : Loaded 12 Verified encode                     951.36ms ( 20.76Mbit/s) : Loaded 12 Verified encode
 547.49ms ( 36.08Mbit/s) : Loaded 12 Unverified encode                   529.14ms ( 37.33Mbit/s) : Loaded 12 Unverified encode
 347.81ms ( 56.79Mbit/s) : Loaded 12 decode                              357.71ms ( 55.22Mbit/s) : Loaded 12 decode
1191.98ms ( 16.57Mbit/s) : Loaded 16 Verified encode                    1135.81ms ( 17.39Mbit/s) : Loaded 16 Verified encode
 648.67ms ( 30.45Mbit/s) : Loaded 16 Unverified encode                   654.03ms ( 30.20Mbit/s) : Loaded 16 Unverified encode
 527.66ms ( 37.44Mbit/s) : Loaded 16 decode                              496.93ms ( 39.75Mbit/s) : Loaded 16 decode
Backfill verified encode bandwidth 262.20 Mbit/s [16 threads]           Backfill verified encode bandwidth 272.78 Mbit/s [16 threads]
Backfill unverified encode bandwidth 474.70 Mbit/s [16 threads]         Backfill unverified encode bandwidth 490.32 Mbit/s [16 threads]
Backfill decode bandwidth 597.99 Mbit/s [12 threads]                    Backfill decode bandwidth 633.80 Mbit/s [12 threads]
AGSaidi commented 1 year ago

Please use an ‘isb’ not a ‘dmb’. See this as an example. https://github.com/haproxy/haproxy/commit/1e237d037b3a45ec92d1dfa80dfd2c6bd7fc3af9

sebpop commented 1 year ago

Looks good to me.