OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

Use SSE2 optimizations on ARM Neon for warping, pansharpening, gridding, dithering, RPC, PNG, GTI #11239

Open rouault opened 6 days ago

rouault commented 6 days ago

(on top of https://github.com/OSGeo/gdal/pull/11237)

Comparing https://github.com/OSGeo/gdal/actions/runs/11766759419/job/32774680289?pr=11237 (before) and https://github.com/rouault/gdal/actions/runs/11766932147/job/32775064275 (this PR), shows on Apple Silicon:

before:

Name (time in us)                                                             Min                       Max                      Mean                 StdDev                    Median                     IQR            Outliers          OPS            Rounds  Iterations
test_gdalwarp[cubic-1]                                             1,393,559.0000 (>1000.0)  1,408,203.0000 (>1000.0)  1,402,279.4000 (>1000.0)   5,568.3731 (753.32)   1,402,622.0000 (>1000.0)    6,975.2500 (>1000.0)       2;0       0.7131 (0.00)          5           1
test_gdalwarp[cubic-ALL_CPUS]                                      1,393,650.0000 (>1000.0)  1,600,824.0000 (>1000.0)  1,455,685.8000 (>1000.0)  85,820.3716 (>1000.0)  1,413,421.0000 (>1000.0)   98,352.0000 (>1000.0)       1;0       0.6870 (0.00)          5           1

this PR:

Name (time in us)                                                             Min                       Max                      Mean                 StdDev                    Median                     IQR            Outliers          OPS            Rounds  Iterations
test_gdalwarp[cubic-1]                                             1,294,533.0000 (>1000.0)  1,332,336.0000 (>1000.0)  1,311,688.6000 (>1000.0)  14,026.4810 (>1000.0)  1,313,216.0000 (>1000.0)  17,055.7500 (>1000.0)       2;0       0.7624 (0.00)          5           1
test_gdalwarp[cubic-ALL_CPUS]                                      1,271,232.0000 (>1000.0)  1,287,036.0000 (>1000.0)  1,280,295.6000 (>1000.0)   6,870.0957 (>1000.0)  1,282,034.0000 (>1000.0)  12,030.0000 (>1000.0)       1;0       0.7811 (0.00)          5           1

So a 1,402,279.4000 down to 1,332,336.0000 us execution time for the single threaded use case, a 7% improvement, and a 13% improvement in the multithreaded code path. Note: those timings might not be super reliable due to being done on 2 separate VM execution, but they at least show some improvement