SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Number of max resources not updated by `set_max_resource`? #344

Open rhugonnet opened 12 months ago

rhugonnet commented 12 months ago

Hey @jpswinski,

I'm running the following code adapted from the first notebook:

import sliderule
from sliderule import icesat2, earthdata

# Set max resources
earthdata.set_max_resources(5000)

# Configure ICESat-2 API
icesat2.init("slideruleearth.io")

# Specify region of interest from geojson
poly_fn = '/home/atom/data/inventory_products/RGI/00_rgi60_neighb_renamed/11_rgi60_CentralEurope/region_11_rgi60_CentralEurope.shp'
region = sliderule.toregion(poly_fn)["poly"]
region

parms = {
    "poly": region,
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "ats": 20.0,
    "cnt": 10,
    "len": 200.0,
    "res": 100.0,
    "maxi": 1
}

# Request ATL06 Data
gdf = icesat2.atl06p(parms)

which fails with the default max resources of 300 instead of the 5000 that I set:

Exceeded maximum requested resources: 2875 (current max is 300)
Consider using earthdata.set_max_resources to set a higher limit.

Am I doing anything wrong?

The region is just a big shapefile with all glaciers polygons in the European Alps, which I guess gets converted into a convexhull of all dissolved features? (couldn't find info on this in the doc, adding to #343) Here it is to reproduce the behaviour on your side!

region
Out[3]: 
[{'lon': 19.81509691800005, 'lat': 42.44601641700007},
 {'lon': 19.815423874000032, 'lat': 42.44609679300004},
 {'lon': 19.81568442400004, 'lat': 42.446187217000045},
 {'lon': 19.81701517500005, 'lat': 42.44702111400005},
 {'lon': 19.817565753000054, 'lat': 42.44739285400004},
 {'lon': 19.817779399000074, 'lat': 42.447603843000024},
 {'lon': 19.817863416000023, 'lat': 42.44784497200004},
 {'lon': 19.817847872000073, 'lat': 42.448015772000076},
 {'lon': 19.809863662000055, 'lat': 42.46858198700005},
 {'lon': 19.069049540000037, 'lat': 43.11633126900006},
 {'lon': 13.61632439300007, 'lat': 47.49373761100003},
 {'lon': 13.615554032000034, 'lat': 47.49403870300006},
 {'lon': 13.614783662000036, 'lat': 47.49433979100007},
 {'lon': 13.612028267000028, 'lat': 47.49472089900007},
 {'lon': 13.599418152000055, 'lat': 47.49630852300004},
 {'lon': 12.866649521000056, 'lat': 47.57680294200003},
 {'lon': 10.996501614000067, 'lat': 47.427295654000034},
 {'lon': 10.993139714000051, 'lat': 47.42686264500003},
 {'lon': 8.998158185000023, 'lat': 47.00959166800004},
 {'lon': 8.470754617000068, 'lat': 46.861506761000044},
 {'lon': 8.469970131000025, 'lat': 46.86123316100003},
 {'lon': 7.214600805000032, 'lat': 46.330964840000036},
 {'lon': -0.2889102479999792, 'lat': 42.839258089000054},
 {'lon': -0.2889514309999299, 'lat': 42.83919681800006},
 {'lon': -0.260864349999963, 'lat': 42.78326050900006},
 {'lon': -0.25987791299996843, 'lat': 42.782043721000036},
 {'lon': -0.259824981999941, 'lat': 42.78199718700006},
 {'lon': -0.2596992089999617, 'lat': 42.78191555300003},
 {'lon': -0.058874394999975266, 'lat': 42.69404030700008},
 {'lon': -0.05825715699995726, 'lat': 42.69377428400003},
 {'lon': -0.05774544599995579, 'lat': 42.69360450600004},
 {'lon': -0.05769566299994722, 'lat': 42.69359179000003},
 {'lon': 0.03582901400005767, 'lat': 42.67745517700007},
 {'lon': 0.041947870000058174, 'lat': 42.67685429200003},
 {'lon': 0.6660746110000559, 'lat': 42.62564946500004},
 {'lon': 0.6661575190000235, 'lat': 42.62564394900005},
 {'lon': 13.567087100000037, 'lat': 42.469923307000045},
 {'lon': 19.81509691800005, 'lat': 42.44601641700007}]
rhugonnet commented 12 months ago

Also, any advice on how I should set-up large-scale requests? (all glaciers worldwide) What max resources should I aim for by request for the best performance? (also added a point in #343 regarding the unit of max resources)

(That should give me guidance on how to split my requests in smaller bits! :smile:)

jpswinski commented 12 months ago

@rhugonnet The issue is the call to icesat2.init resets the max resources back to the default. The reason for this is historical. When we first started and icesat2 was our only mission, the icesat2.init function initialized all of the parameters that could be configured for the client. So if you look at the argument list for that function:

https://github.com/ICESat2-SlideRule/sliderule/blob/52cfa45ddfd6b8f83a031e89fd153779e9238f60/clients/python/sliderule/icesat2.py#L278

you can see the max_resources parameter has a default setting that gets applied if it isn't provided. This is the only init function that behaves this way. Since then, the SWOT and GEDI init functions (along with any of the other missions we will add in the future), will not have this argument. We kept it this way to minimize the changes needed to people who had scripts that used this function.

So in your script above, if you flip the order in the lines:

# Set max resources
earthdata.set_max_resources(5000)

# Configure ICESat-2 API
icesat2.init("slideruleearth.io")

then that should take care of the problem. Alternatively, you could also add the max_resources to the icesat2.init call.

# Configure ICESat-2 API
icesat2.init("slideruleearth.io", max_resources=5000)
jpswinski commented 12 months ago

As for using the shapefile - yes, in the code you have, the convex hull is being generated and used to subset to the area of interest. If you want to preserve the features of the shapefile, you need to use the "raster" option in the request parameters.

# Specify region of interest from geojson
poly_fn = '/home/atom/data/inventory_products/RGI/00_rgi60_neighb_renamed/11_rgi60_CentralEurope/region_11_rgi60_CentralEurope.shp'
region = sliderule.toregion(poly_fn) # NOTE REMOVED ["poly"]

parms = {
    "poly": region["poly"],
    "raster": region["raster"],  # ADD THIS LINE HERE
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "ats": 20.0,
    "cnt": 10,
    "len": 200.0,
    "res": 100.0,
    "maxi": 1
}

This will send a geojson representation of the shapefile to the servers, where they will burn a raster of the geojson and use that raster as an inclusion mask. When making the call to sliderule.toregion(poly_fn), the function takes a parameter cellsize which specifies the size of the pixel of the raster. It is in degrees and defaults to 0.01 degrees.

This functionality has not gotten a lot of use, and still needs some work. So please provide us feedback as you go on how we can make it better. One of the things we know about (and is on our short list to work on), is that using the shapefile this way (i.e. burning a raster for an inclusion mask) slows down the subsetting substantially. So please expect significantly longer runs. We have some ideas on how to make this faster, but haven't had the time to do it yet. But you using this functionality is good motivation to get on it.

jpswinski commented 12 months ago

Lastly, for large processing runs like this, you should definitely use one of the private clusters. I'd recommend using the uw private cluster. The reason is two fold - 1. a request this large will consume all the resources of the public cluster and make it unavailable for others while working on the request. 2. with a private cluster we can scale up the number of nodes to a much higher number and provide a lot faster response to you.

If you haven't done so already, you can create an account on https://ps.slideruleearth.io to get started. Then here is a link to our write up on how to use a private cluster: https://slideruleearth.io/web/rtd/user_guide/Private-Clusters.html.

If you have any questions, please let me know.

rhugonnet commented 12 months ago

Thanks a lot for all the info @jpswinski! :smiley:

Moving forward with this.

I'm adding this init order + double method for max_resource + polygon behaviour in #343 to remind ourselves to clarify in the docs for other users, if it isn't already there! (I'll probably miss where it is described in some cases!)

rhugonnet commented 11 months ago

Also: to find "outdated examples" that fail during CI, we could activate doctests for the SlideRule Python client :smile: For example, by putting this line into a pyproject.toml in clients/python/, it will run those with pytest:

[tool.pytest.ini_options]
addopts = "--doctest-modules"
testpaths = [
    "tests",
    "sliderule"
] 

More generally, the pyproject.toml file is also now recommended for setuptools instead of setup.py, as multiple tools definitions can be shared in there. See for example: https://www.reddit.com/r/learnpython/comments/yqq551/pyprojecttoml_setupcfg_setuppy_whats_the/. Usage example here: https://github.com/pypa/sampleproject/blob/main/pyproject.toml. Most project keep their setup.py/setup.cfg for backwards-comp.

What do you think @jpswinski?