SDOML / SDOMLv2

MIT License
9 stars 1 forks source link

Advice for using cutouts of images #5

Open JulioHC00 opened 1 year ago

JulioHC00 commented 1 year ago

Hi! Not really an issue git the repository, but I'd really appreciate it if I can get some advice on how to approach my objective as I'm not used to dealing with such large amounts of data.

I need to download cutouts of the full Sun images that correspond to SHARPs patches. I have these bounding boxes stored in a database such that each is identifies by a timestamp (corresponding to one of the images) and to one active region (so there's potentially more than one bounding box per timestamp. Each bounding box is defined by x_min, x_max, y_min, y_max, so it can be used directly on the image. My problem is that there's roughly ~millions of cutouts and with 3 components of the magnetograms this scales up very quickly. So far, I'd been trying to process several chunks of each year's data at the same time and then storing each cutout in an individual file but that seems quite unefficient. Any guidance as to how this could be better approached would be greatly appreciated.

PaulJWright commented 1 year ago

I would probably suggest dask or zarr as we use here, I would also check out this notebook from @wtbarnes https://gist.github.com/wtbarnes/8c1e8e8e39414784fa24cca3e697dfff

JulioHC00 commented 1 year ago

Thanks! I've given it a try and at the moment it takes ~2-3 min per harpnum to process. Maybe this is as fast as I can get it to go, but it doesn't feel right. For example, for harp 104 the indices span from 9846 to 10684 which is from around 2010-07-31 to 2010-08-07 (about 7 days). I know that the data is stored in chunks, and the way I process it doesn't exploit these chunks. I've put the code that I wrote in a gist, if at any point you have some time to have a look at it, I'd really appreciate any suggestions you can make. Though I understand if you can't help with this and that's perfectly fine!

download_cutouts.py

PaulJWright commented 1 year ago

Okay, i'll see if I can find time to look at this over the weekend or next week. Let me know if you come up with a solution!