Closed anantmittal closed 4 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
63547fc
) 83.54% compared to head (ec44c38
) 88.91%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This current PR seem to make a difference in terms of parallelization. @leewujung would you be able to go through the changes and see if this is what you're looking for for this case? Thank you!
I did some simple benchmarking test with a test file in the ek80/D20170912-T234910.raw
. The test does the following:
271.2 MB
to 5.3 GB
.compress_pulse
function that have been modified and looks like when run in parallel with dask, we do see some speed up compared to not. W/O Dask the run took 58.1 s With Dask the run took 24.3 s
Looking at the dask dashboard, the computation happens in parallel
For reproducibility, you can see my run in the following gist: https://gist.github.com/lsetiawan/20c09c3eaa76508a741e31ff28a83402
I tested this using a not very large dataset (test_data/ek80/D20170912-T234910.raw, ~95 MB), and found that by forcing the backscatter_r/i variables to be dask arrays, the parallelization didn't seem to consistently improve the speed compare to straight in-memory computation.
For a small dataset, due to the dask overhead for creating task graph, it's probably better to load things in memory first then do the computation. When I did try this with the original data, sure enough a dask array is slower than just in memory computation.
The approach suggested in this PR works for both in-memory and dask array, which is really powerful since when the data is big, we can parallelize the convolution! This I think satisfy the issue in https://github.com/OSOceanAcoustics/echopype/issues/1164. Further optimization can probably happen, but it's a first start 😄
@leewujung is this good to go?
Yep, I think this is good now. Thanks for doing the benchmarking tests!
Explores #1164