OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

Enable ARM Neon optimizations in gcore/ using sse2neon.h #11202

Closed rouault closed 6 days ago

rouault commented 1 week ago

On top of PR #11199

This uses the sse2neon.h header (MIT licensed) from https://github.com/DLTcollab/sse2neon that translates Intel SSEx intrinsincs to ARM Neon ones.

This accelerates GDALCopyWords(), overview/resampled RasterIO() and gdal_minmax_element.hpp

On the arm64 OSX github worker, this gives very substantial speeds up in gdal_minmax_element.hpp: ~ 30x in the uint8 case, ~ 7x in the float case and ~ 3x in double case

uint8:
min at idx 279762 (optimized)
-> elapsed=308625
min at idx 279762 (using std::min_element)
-> elapsed=10565709
min at idx 189424 (nodata case, optimized)
-> elapsed=387667
min at idx 189424 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=6962000
--------------------
int8:
min at idx 112 (optimized)
-> elapsed=384500
min at idx 112 (using std::min_element)
-> elapsed=10741333
min at idx 112 (nodata case, optimized)
-> elapsed=315875
min at idx 112 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=9452958
--------------------
uint16:
min at idx 13240 (optimized)
-> elapsed=765416
min at idx 13240 (using std::min_element)
-> elapsed=10305500
min at idx 939179 (nodata case, optimized)
-> elapsed=622167
min at idx 939179 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=6823417
--------------------
int16:
min at idx 6018988 (optimized)
-> elapsed=523541
min at idx 6018988 (using std::min_element)
-> elapsed=24098500
min at idx 6018988 (nodata case, optimized)
-> elapsed=540083
min at idx 6018988 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=7321292
--------------------
uint32:
min at idx 41566 (optimized)
-> elapsed=1151584
min at idx 41566 (using std::min_element)
-> elapsed=9062583
min at idx 30452 (nodata case, optimized)
-> elapsed=1355584
min at idx 30452 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=7326292
--------------------
int32:
min at idx 7211003 (optimized)
-> elapsed=1680292
min at idx 7211003 (using std::min_element)
-> elapsed=9694292
min at idx 7211003 (nodata case, optimized)
-> elapsed=1094458
min at idx 7211003 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=8249917
--------------------
float (*with* NaN):
min at idx 38692 (optimized)
-> elapsed=1041709
min at idx 38692 (using std::min_element with NaN aware comparison)
-> elapsed=10099958
min at idx 38692 (nodata case, optimized)
-> elapsed=1416250
min at idx 38692 (nodata case, using std::min_element with nodata aware and NaN aware comparison)
-> elapsed=17642458
--------------------
float (without NaN):
min at idx 8056959 (optimized)
-> elapsed=1207667
min at idx 8056959 (using std::min_element)
-> elapsed=13740459
min at idx 8056959 (nodata case, optimized)
-> elapsed=2653459
min at idx 8056959 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=9368625
--------------------
double (*with* NaN):
min at idx 9172351 (optimized)
-> elapsed=3680542
min at idx 9172351 (using std::min_element with NaN aware comparison)
-> elapsed=12326666
min at idx 9172351 (nodata case, optimized)
-> elapsed=3552875
min at idx 9172351 (nodata case, using std::min_element with nodata aware and NaN aware comparison)
-> elapsed=17734000
--------------------
double (without NaN):
min at idx 5511672 (optimized)
-> elapsed=1971709
min at idx 5511672 (using std::min_element)
-> elapsed=14964833
min at idx 5511672 (nodata case, optimized)
-> elapsed=3211833
min at idx 5511672 (nodata case, using std::min_element with nodata aware comparison)
-> elapsed=7257792
coveralls commented 1 week ago

Coverage Status

coverage: 69.473% (+0.05%) from 69.428% when pulling 0ef39fcfac948f75bfe09e95207fb483abc2de2c on rouault:gdal_minmax_element_sse2neon into f5eedd2a7e7622bffabcbd30d92c2aeb6d9526ec on OSGeo:master.

pjonsson commented 1 week ago

I guess GDALCopyWords() is a well-known hotspot, so that has its own microbenchmark. Beyond these known hotspots, is there any documentation on how to profile GDAL?

rouault commented 1 week ago

Beyond these known hotspots, is there any documentation on how to profile GDAL?

nothing specific. The usual tools you would use to profile any C/C++ software: gprof, sysprof, valgrind --tool=cachegrind, Intel VTune (proprietary), etc. My favorite one is more modest: find a processing that is at least one minute long, run it under gdb, and regularly interrupt with ctrl+c and display the stack trace. Quite efficient at exhibiting hot spots.