catalyst-cooperative / pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.
MIT License
3 stars 3 forks source link

Combine archiver/scraper repos into a single repo #50

Closed zschira closed 1 year ago

zschira commented 1 year ago

Background

To simplify the archiving/scraping processes, and to enable automation of these processes, we should combine the archiver/scraper repos. These two repos are already tightly bound, as the archiver looks for data downloaded by the scraper and creates zenodo archives containing this data. Combining these two repos will allow for formalizing this dependency and making it easier to add/maintain datasets.

Tasks

zaneselvans commented 1 year ago

A minor thing that would be nice to change here is also getting rid of the hard-coded ~/Downloads/pudl_scrapers/scraped path for where the data is dropped off and picked up, since it's very OS / user specific. Especially if these are meant to run on GitHub Actions automatically, and they are going create archives as they go, the downloaded files could probably just go in a temporary directory and get deleted at the end of the archiving process.

zschira commented 1 year ago

Combining the two repos offers us an opportunity to simplify some of the inter-repo metadata dependencies. These dependencies can make it cumbersome to integrate new data sources, as it's often difficult to understand what needs to be updated, and where it all lives. While undergoing the refactoring process involved with combining these repos, it should be a priority to consolidate as much of this information as possible, and make it easier to update. Below are the main friction points I've identified, with possible solutions for each:

Datasource Metadata

Currently the archiver repo depends on the DataSource metadata and class implemented in PUDL. Ultimately we don't want any of our tools depending on PUDL, so this metadata should be removed from PUDL. Allowing our metadata classes to be used by other projects has also been brought up in catalyst-cooperative/pudl#1522. As outlined in this issue, the metadata classes are fairly tightly bound to PUDL at the moment, and will take a decent amount of refactoring to pull this out.

Zenodo Archive Format

Both PUDL and the scraper must understand the structure of the archives at some level. The format of the archives is somewhat standard, but there are many anomalies. For example, FERC 2 DBF data is comprised of 2 partitions for the years 1991-1999, but only a single partition after that, and all of the FERC archives will contain DBF and XBRL data. The scrapers need to understand this information to download the data, and create archives, and PUDL needs to understand how to parse the archives (done via the Datastore). I've considered removing the Datastore, and encapsulating all of this information/logic in one place.

Proposed solutions

Option 1: Gradual refactor

Perhaps the most obvious solution would be to combine the repos with minimal refactoring right now, but with the intentions to fully disentangle the archiver/scraper from PUDL. This would mean maintaining the dependency on PUDL for the time being, while we continue to work to extract the metadata classes from PUDL.

Option 2: Combine archiver/scraper with PUDL

My second proposed solution is to actually combine the archiver/scraper and PUDL into one mono-repo. This seems somewhat counterintuitive as we are trying to reduce inter-repo dependencies, but I think perhaps simply formalizing those dependencies could be a really simple and clean solution. This would the scraper/archivers immediate access to all PUDL metadata required, while also potentially making it possible to integrate the scraping/archiving process into nightly builds and other PUDL automation. The biggest obvious drawback is adding additional code to the already large codebase that is PUDL, but with some good module organization, this might not be that big of a deal. I also think moving away from PUDL as a library makes this more feasible, as we can control our dependencies better.

zaneselvans commented 1 year ago

The mono-repo option makes me kind of nervous. Having everything wrapped up in the main PUDL repo seems like a setup for lots of entanglement between the different parts, and I'm wondering if that's unavoidable, or if there's a meaningful way to split up these concerns.

Both the scrapers/archivers and PUDL need to be able to access the metadata describing the data sources, but the metadata itself is almost just a static collection of Pydantic data models. So one reasonable arrangement seems like it would be to have a simple pudl-metadata repo that just stores that information, which both the archivers and PUDL depend on.

It also really feels like this must be a solved problem: storing blobs of unstructured or semi-structured data, such that particular blobs can be addressed based on some set of key-value pairs, and storing metadata associated with those blobs. Having some taxonomy of blobs. Is this what a so-called data lake or data lakehouse is?

zschira commented 1 year ago

Closing, see the new archiver repo