deeplycloudy / glmtools

GOES-R Geostationary Lightning Mapper Tools
BSD 3-Clause "New" or "Revised" License
62 stars 33 forks source link

speed of GLM grid generation for ABI CONUS fixed grid #68

Open jlc248 opened 4 years ago

jlc248 commented 4 years ago

I'm working on the master branch, creating GLM gridded files for the GOES East ABI CONUS fixed grid (shape=(2500,1500)).

I could be misremembering, but the speed seems a bit slow. It takes about 20 sec to process 1 min of data (three L2 GLM files). I'm processing 1 min at a time. make_GLM_grids.py also seems to gobble up all of the available CPU threads on a machine (40 in my case).

Does that cadence seem about right to you? When I created a lot of GLM data for the CONUS fixed grid two years ago, I thought it was much faster, but I could be wrong. If there is a slowdown, is that mainly due to the computation of min_flash_area? something else? Perhaps there is some more optimal way to process a bunch of data at once?

Lastly, is there a way to limit the number of CPU threads that make_GLM_grids.py uses? Or is that not possible/recommended because of slower performance?

I understand it's a lot of processing to unravel all of the parent<->child relationships, so if this is just the way it is, that's totally fine. I'm just curious. And it's plenty fast for realtime processing.

deeplycloudy commented 4 years ago

@jlc248 looking at the creation timestamp on the CONUS grids I create for Unidata'a THREDDS, that's pretty close to what I'm getting. There was a performance regression related to .min() in pandas that caused me some problems, too. That is as-yet unfixed upstream.

Regarding threads, I thought I had turned off the parallel processing for both polygon clipping and another spot where I had put in a foundation for some tiling. If you search for "pool" in source, that should be all the locations where there would be parallelism, but it does show 600% CPU on the AWS instance that I'm running right now. Maybe there's something to fix, there …

zxdawn commented 3 years ago

@deeplycloudy I suppose we can test the speed of dask_groupby(). That should improve a lot.