The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
We have been limping along with a manual, brute-force "import everything all the time" approach to ensuring that all our modules are imported, but this is getting unwieldy, and it seems like it may be time to try and understand how imports really work and do them correctly / dynamically
Current Situation
Any time we add a new module to PUDL, we add it to an exhaustive list of absolute imports in the top-level __init__.py
Problems
PyLance and other linters see the absolute imports in the top-level __init__.py file and interpret them as "unused" since they aren't referenced anywhere else in that module. Pylance then removes the first module from the list!
When attempting to compile a subpackage-level dictionary in that subpackage's __init__.py file, out of module-level dictionaries defined in various modules within the package, the subpackage-level dictionary isn't updated by autoreload.
import pudl is slow (like 5-10 seconds?) to run, which is annoying at the CLI and when restarting notebooks.
We aren't being particularly careful about what functions / classes we present to users.
In some places we have a fairly deep module hierarchy, which is annoying to navigate and has some modules with generic names that are fine for organizing code, but that don't really present a good UI for using the code. E.g. the Pydantic modules defined within pudl.metadata.classes might be more ergonomically accessed directly within the pudl.metadata namespace, even while it would be nice to split the individual classes out into their own individual modules within a pudl.metadata.classes subpackage, since the classes.py module is way too big right now.
We have some infrequently used analyses that depend on complicated dependencies. It would be nice if we didn't need to be importing that stuff all the time. This includes GIS libraries (pygeos, fiona, shapely, geopandas...), plotting / dataviz (matplotlib), and Datasette. Ideally those dependencies would be optional extras, and the modules referring to them wouldn't be imported unless the extras have been installed.
It's easy to forget to add your new module to the top level __init__.py and it would be better to have that action take place local to the subpackage you're actually adding the module to.
Restarting the ipython kernel completely while working on data processing code can be very inconvenient, since sometimes you are working with an intermediate data product that took several minutes to compute, so we really benefit from the %%autoreload magic and would like for it to work.
When autoreload attempts to run after there have been changes to a module in which pydantic classes are defined, it fails (because the intra-class magic of autoreload and pydantic are clashing?). This seems to result in modules that should have been reloaded not getting reloaded, and requiring manual re-imports to freshen up.
Real Python on reloading modules: "Also, be aware that reload() has some caveats. In particular, variables referring to objects within a module are not re-bound to new objects when that module is reloaded. See the documentation for more details."
importlib.reload(): "Other references to the old objects (such as names external to the module) are not rebound to refer to the new objects and must be updated in each namespace where they occur if that is desired."
The problem here seems to be that defining a variable in __init__.py and assigning something to it. That something is defined elsewhere (in one of the imported modules from within the subpackage). When one of those modules is changed, the objects in their namespaces get updated -- if there was a dictionary at params.ferc1.PARAMS and you added something to it, then it's changed. But the dictionary cobbled together from that and other module constants at params.ALL_PARAMS inside __init__.py doesn't get updated because that __init__.py module wasn't changed, and the system can't go chasing down every value inside every variable that was ever set based on a constant in a module that was changed.
One could forcibly reload all the modules in the subpackage to make sure that you always have the updated versions of them... but those importlib.reload() statements also won't get executed, since the module they're in isn't changing itself and getting autoreloaded.
Defining a function in __init__.py that reads variables directly from the modules and compiles them into a dictionary and returns them works though, since the function doesn't store the values, just the procedure.
Okay but why are we doing this at all?
We don't want to pass in the transform params as an argument, because we need Dagster to be able to take just the table name as its input.
We could store each table transformer's parameters inside the table transformer class itself, but the parameters are in some cases very large dictionaries, so for readability we don't want to store them inside the classes themselves.
Instead we currently have a reference to the transform params inside the parent abstract base class, such that any child class can look up its transformation parameters based on its table ID.
It's fine for the child class definitions to depend on the abstract base class.
It's fine for the abstract base class to depend on a bunch of constants.
We can't have the abstract base class depend on anything defined in the modules with the child classes.
So we have: concrete classes depend on ABC, and ABC depends on constants/params.
We can run this dependency through __init__.py or we can skip that step entirely and just have pudl.transform.params.classes be the place that the several dataset-specific dictionaries of parameters are compiled, since that's where the big dictionary of constants is going to be used anyway. In fact it can be compiled inside the ABC itself....
We have been limping along with a manual, brute-force "import everything all the time" approach to ensuring that all our modules are imported, but this is getting unwieldy, and it seems like it may be time to try and understand how imports really work and do them correctly / dynamically
Current Situation
__init__.py
Problems
__init__.py
file and interpret them as "unused" since they aren't referenced anywhere else in that module. Pylance then removes the first module from the list!__init__.py
file, out of module-level dictionaries defined in various modules within the package, the subpackage-level dictionary isn't updated by autoreload.import pudl
is slow (like 5-10 seconds?) to run, which is annoying at the CLI and when restarting notebooks.pudl.metadata.classes
might be more ergonomically accessed directly within thepudl.metadata
namespace, even while it would be nice to split the individual classes out into their own individual modules within apudl.metadata.classes
subpackage, since theclasses.py
module is way too big right now.pygeos
,fiona
,shapely
,geopandas
...), plotting / dataviz (matplotlib
), andDatasette
. Ideally those dependencies would be optional extras, and the modules referring to them wouldn't be imported unless the extras have been installed.__init__.py
and it would be better to have that action take place local to the subpackage you're actually adding the module to.%%autoreload
magic and would like for it to work.Resources
__init__.py
importlib.reload()
Research Notes
On
__init__.py
variables not always reloadingimportlib.reload()
: "Other references to the old objects (such as names external to the module) are not rebound to refer to the new objects and must be updated in each namespace where they occur if that is desired."__init__.py
and assigning something to it. That something is defined elsewhere (in one of the imported modules from within the subpackage). When one of those modules is changed, the objects in their namespaces get updated -- if there was a dictionary atparams.ferc1.PARAMS
and you added something to it, then it's changed. But the dictionary cobbled together from that and other module constants atparams.ALL_PARAMS
inside__init__.py
doesn't get updated because that__init__.py
module wasn't changed, and the system can't go chasing down every value inside every variable that was ever set based on a constant in a module that was changed.importlib.reload()
statements also won't get executed, since the module they're in isn't changing itself and getting autoreloaded.__init__.py
that reads variables directly from the modules and compiles them into a dictionary and returns them works though, since the function doesn't store the values, just the procedure.__init__.py
or we can skip that step entirely and just havepudl.transform.params.classes
be the place that the several dataset-specific dictionaries of parameters are compiled, since that's where the big dictionary of constants is going to be used anyway. In fact it can be compiled inside the ABC itself....