The PR removes the _full_jacobian from the base stepper.
I had to split the parameter_transport operator into two functions to make it work with jacobian_validation test.
The performance is improved slightly:
Main
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8 2139250 ns 2139123 ns 266 TracksPropagated=29.9188k/s
CUDA unsync propagation/16 3162398 ns 3162309 ns 219 TracksPropagated=80.9535k/s
CUDA unsync propagation/32 3989859 ns 3989710 ns 174 TracksPropagated=256.66k/s
CUDA unsync propagation/64 5023091 ns 5022813 ns 138 TracksPropagated=815.479k/s
CUDA unsync propagation/128 11542654 ns 11541598 ns 60 TracksPropagated=1.41956M/s
CUDA unsync propagation/256 40331589 ns 40330407 ns 17 TracksPropagated=1.62498M/s
CUDA sync propagation/8 2103464 ns 2103411 ns 329 TracksPropagated=30.4268k/s
CUDA sync propagation/16 3188313 ns 3188082 ns 219 TracksPropagated=80.2991k/s
CUDA sync propagation/32 4018884 ns 4018790 ns 174 TracksPropagated=254.803k/s
CUDA sync propagation/64 5077063 ns 5076936 ns 135 TracksPropagated=806.786k/s
CUDA sync propagation/128 11621492 ns 11621185 ns 59 TracksPropagated=1.40984M/s
CUDA sync propagation/256 40868906 ns 40867001 ns 17 TracksPropagated=1.60364M/s
PR
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8 33818441294 ns 33815892218 ns 1 TracksPropagated=1.8926/s
CUDA unsync propagation/16 3093877 ns 3093738 ns 184 TracksPropagated=82.7478k/s
CUDA unsync propagation/32 3919884 ns 3919738 ns 177 TracksPropagated=261.242k/s
CUDA unsync propagation/64 4791991 ns 4791790 ns 142 TracksPropagated=854.795k/s
CUDA unsync propagation/128 11155984 ns 11155452 ns 63 TracksPropagated=1.4687M/s
CUDA unsync propagation/256 39361889 ns 39359983 ns 18 TracksPropagated=1.66504M/s
CUDA sync propagation/8 2065692 ns 2065608 ns 335 TracksPropagated=30.9836k/s
CUDA sync propagation/16 3111323 ns 3111089 ns 224 TracksPropagated=82.2863k/s
CUDA sync propagation/32 3969601 ns 3969425 ns 179 TracksPropagated=257.972k/s
CUDA sync propagation/64 4978605 ns 4978415 ns 139 TracksPropagated=822.752k/s
CUDA sync propagation/128 11225573 ns 11225494 ns 61 TracksPropagated=1.45953M/s
CUDA sync propagation/256 39701109 ns 39699952 ns 18 TracksPropagated=1.65078M/s
The PR removes the _full_jacobian from the base stepper. I had to split the parameter_transport operator into two functions to make it work with jacobian_validation test.
The performance is improved slightly:
Main
PR