Closed zaneselvans closed 7 months ago
I'm currently getting an error in the conversion of our settings classes into Dagster configs, I think because of the taxonomy URL being part of our settings. But that element is getting a type of Any
in the conversion right now, so not sure why Dagster is still finding it to be a URL:
TypeError: Object of type Url is not JSON serializable
I was trying to look at converting our settings classes over to being Dagster Configs directly, but it seems like there are only two high-level Dagster Configs that we're constructing, one for all of the FERC to SQLite conversions, and one of all the rest of the datasets. Is that what we want? Would that mean we need to consolidate all of our existing ETL settings together into a single class?
It seems like we just have to make all of the AnyHttpUrl
attributes into strings to be compatible with the Dagster configs.
The current set up I created for sure feels awkward / not usable with Pydantic 2.0 and the new dagster Config based on pydantic. I made a DatasetsSettings
object as a Resource so any asset can access it. I also wrote the create_dagster_config()
function to convert the DatasetSettings
class into a Legacy dagster run configuration so we could configure the dataset_settings
resource at run time. While a bit janky, it felt like the only way to translate our existing ETL configuration into dagster world.
I think it makes sense to convert our pydnatic settings classes directly into dagster config now that dagster configs can be pydantic classes. No need to have legacy configuration passed to a pydantic resource to do validation dagster's legacy config system didn't support.
I think we can do this using nested dagster configurations. Something like this:
from dagster import asset, Config, RunConfig
class EpaCemsSettings(Config):
years: list[int]
states: list[str]
class DatasetsSettings(Config):
eia: EiaSettings
epacems: EpaCemsSettings
...
@asset
def extract_eia860(config: DatasetsSettings):
years = config.eia.eia860.years
...
I wonder if it still make sense to have all of our dataset configs to live one DatasetsSettings
class so that assets can access all dataset configurations. I think we did this originally because there was some existing functionality that expected all of the data source config. I'll take a look.
Ah but if we don't keep a configurable Resource around the syntax for configuring the assets gets weird:
assets:
extract_eia860:
config:
years: [2020,2021]
glue:
config:
years: [2020,2021]
...
# every asset that depends on dataset configuration
Ok I finally got through all of the new ConfigurableResource
and new Config
documentation. I also went through all of our assets that depend on the datasets_settings
resource. Most of the assets depend on config specific to the data source it's processing. For example extract_ferc714
uses dataset_settings.ferc714
to figure out which years to process. However, there is configuration information that needs to be shared among multiple assets. For example, dataset_settings.eia.eia860.eia860m
is referenced in multiple assets so it doesn't make sense to have asset-specific configuration.
I think we should keep all of our dataset configurations as a single resource. It seems like we can create nested ConfigurableResource
s. I was able to create a toy example of what this might look like in a dagster question I posted that I quickly answered myself 🤦
I need to play around with this a bit more but I must eat dinner now :)
Okay so... a Config
is a per-asset setting, while a Resource
or ConfigurableResource
is globally accessible through the context? And because we have a few cases in which there's ETL setting information that pertains to one dataset, but that needs to be accessed by multiple assets, we need to use a ConfigurableResource
?
Do we actually want these to be configurable at run time in Launchpad? Or should they always be read from the settings YAML files?
I've found that when I touch configurations in launchpad, the changes stick around and get messed up, with extra copies of individual years and other weirdness. Don't we pretty much always want to run the DAG based on either the fast or the full ETL settings? Having a GUI that each of us could have twiddled and thus end up running totally different configurations that aren't linked back to what's in the repo seems like a recipe for confusion.
We can't switch over to using Pydantic 2 until Dagster supports it but I wanted a place to collect tasks for once it's supported.