AusClimateService / plotting_maps

Standardising hazard maps for ACS
4 stars 1 forks source link

acs_plotting_maps module slow to load #12

Closed xenct closed 1 week ago

xenct commented 2 months ago

importing acs_plotting_maps is very slow. Investigate what part is slow and see if it is avoidable. I suspect that reading the shapefiles of many different Australian states/LGAs/regions may be a significant part of the slow import. If that is the case, maybe we can avoid loading the shapefiles that we don't use.

xenct commented 2 months ago

Each shapefile takes about 5 seconds to read, even if we don't end up using it (see profiler below) Need to find a way to only read and load the data if we need it. commenting out the unused shapefiles (5 of 7) decreases the runtime to about 10sec from about 40 sec

%%prun -s cumtime
import acs_plotting_maps

image

xenct commented 2 months ago

There doesn't appear to be a lazy load option. I will replace the dictionary with a function, something like "get_region" which will read the list of regions you give it. It should list the valid shapefile in its docstring or error messages.

xenct commented 1 month ago

Removing regions_dict from the from acs_plotting_maps import line will avoid automatically loading regions_dict data but it will also change the workflow for the users. For minimal changes to the workflow, maybe suggest replacing:

from acs_plotting_maps import plot_acs_hazard, regions_dict, cmap_dict, tick_dict

with:

from acs_plotting_maps import plot_acs_hazard, load_region, cmap_dict, tick_dict

regions_dict = {}
regions_dict.update({name: load_region(name)})

This means the user can control which regions are loaded and they can avoid loading unnecessary geopandas dataframes.

Or we could manage it using deprecation warnings eg https://github.com/hckjck/python-deprecation

xenct commented 1 month ago

Improve speed by 4 times. import regions_dict imports an empty dictionary only load shapefiles that you will use you can use any shapefiles you like, "load_regions" makes it easier to load shapefiles in /g/data/ia39/aus-ref-clim-data-nci/shapefiles/data

image

stellema commented 2 weeks ago

It might be more efficient to use a class to manage the shapefiles? I added an example where the shapefile is only loaded when called and it's stored if it is called again.

import geopandas as gpd
from shapely.geometry import box

class RegionShapefiles:
    """Load and return a shapefile based on its name."""

    def __init__(self, path, shapefiles):
        """Create an instance of the RegionShapefiles class.

        Parameters
        ----------
        path : str
            The path to the shapefiles directory.
        shapefiles : list
            A list of shapefile names to load.
        """
        self.path = path
        self.shapefiles = shapefiles
        self.regions_dict = {}

    def __call__(self, name):
        """Retrieve the shapefile for the given region name.

        Parameters
        ----------
        name : str
            The name of the region to retrieve.

        Returns
        -------
        GeoDataFrame or GeoSeries
            The shapefile data for the specified region.
        """
        if name not in self.regions_dict:
            if name.startswith("not_"):
                # This mask is a rectangular box around the maximum extent of the region
                # with a buffer of 10 degrees on every side,
                # with the regions cut out so only the outside is hidden.
                base_name = name[4:]  # Remove 'not_' prefix
                base_region = self(base_name)
                base_region.crs = crs
                not_region = gpd.GeoSeries(
                    data=[
                        box(
                            *box(*base_region.total_bounds).buffer(20).bounds
                        ).difference(base_region["geometry"].values[0])
                    ]
                )
                self.regions_dict[name] = not_region
            else:
                if name in self.shapefiles:
                    self.regions_dict[name] = gpd.read_file(
                        f"{self.path}/{name}/{name}.shp"
                    )
                else:
                    raise ValueError(f"Shapefile for region '{name}' not found.")
        return self.regions_dict[name]

# Define the path and shapefiles
# These will be used for state boundaries, LGAs, NRM, etc
shapefiles_path = "/g/data/ia39/aus-ref-clim-data-nci/shapefiles/data"
shapefiles = [
    "aus_local_gov",
    "aus_states_territories",
    "australia",
    "nrm_regions",
    "river_regions",
    "ncra_regions",
]

# Create an instance of the RegionShapefiles class
region = RegionShapefiles(shapefiles_path, shapefiles)

Example usage: region("australia") or region("not_australia")

Edit: fixed base_name

xenct commented 2 weeks ago

This is really helpful and cool ✨ I think this is exactly the behaviour I wanted. Thanks @stellema! I'll review your PR and include it.

stellema commented 2 weeks ago

I'm glad! The version in #27 works more like a dictionary (i.e., can be accessed with square brackets), so the rest of the code doesn't require updating. I also made the "not_{region}" part only work for not_australia, because it probably won't work for the other regions.