Closed bendnorman closed 2 years ago
Re: the awkward data integration flow... yeah. Since this repo is really just for internal use, maybe it would make sense to have it depend on the PUDL dev
branch at least?
All the frictionless data-package metadata is now being generated from the DataSource
objects now. I've also updated this repo to depend on pudl
dev branch. I also wanted to ask if we want to use metadata from the pudl
repo in zs/metadata.py
? Seems a lot of the metadata stored here is slightly different from anything elsewhere, except for the keywords
, which can probably be easily replaced.
I think anything in zs.metadata
that we can accurately generate using the programmatically accessible metadata we're already going to be maintaining int he PUDL repo should be replaced with references to that metadata. Looking at an example of the metadata that's being stored for Zenodo (i.e. the non-datapackage metadata, that's directly associated with the Zenodo deposit) I think we can probably...
us-govt
=> other-pd
)ferc1_uuid = "d3d91c87-c595-49d5-a7f3-e5f5669c8306"
ferc1 = {
"title": "PUDL Raw FERC Form 1",
"language": "eng",
"upload_type": "dataset",
"description": "<p>Raw Federal Energy Regulatory Commission (FERC) Form 1 data, "
"archived from\n"
"<a href=\"https://www.ferc.gov/industries-data/electric/"
"general-information/electric-industry-forms/"
"form-1-electric-utility-annual\">"
"https://www.ferc.gov/industries-data/electric/"
"general-information/electric-industry-forms/"
"form-1-electric-utility-annual</a></p>"
f"{pudl_description}",
"creators": creators,
"access_right": "open",
"license": "other-pd",
"keywords": [
"electricity", "electric", "utility", "plant", "steam", "generation",
"cost", "expense", "price", "heat content", "ferc", "form 1",
"federal energy regulatory commission", "capital", "accounting",
"depreciation", "finance", "plant in service", "hydro", "coal",
"natural gas", "gas", "opex", "capex", "accounts", "investment",
"capacity", "usa", ferc1_uuid
]
}
Sounds good, I'll go ahead and make those changes. For the contributors I'll use all the contributors unless we have a specific subset in mind
Updated the metadata in zs/metadata.py
. Second question I forgot to mention earlier: how do we want to handle eipinfrastructure
and epaipm
? These sources currently don't exist in pudl
.
The epaipm
is deprecated and can/should be removed.
Would it make sense to define a new DataSource
locally for those that don't exist in the main PUDL repo, and use that object to generate the metdata? Then at least the interface is the same for all of them, and we'd be able to easily migrate a dataset from the scraper/archives only into PUDL if/when that made sense.
@bendnorman is the EIP Infrastructure actually coming from the Zenodo archives for the DBCP work? And should it be? Or was that going to go into the GitHub LFS in that repo?
Is there a dependency issue of some kind going on here?
==================================== ERRORS ====================================
_ ERROR collecting src/catalystcoop-pudl/test/unit/workspace/datastore_test.py _
ImportError while importing test module '/home/runner/work/pudl-zenodo-storage/pudl-zenodo-storage/src/catalystcoop-pudl/test/unit/workspace/datastore_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
src/catalystcoop-pudl/test/unit/workspace/datastore_test.py:8: in <module>
import responses
E ModuleNotFoundError: No module named 'responses'
I'll go ahead and remove epaipm
. And I think it makes sense to have a local DataSource
definition. That's probably the most consistent way to handle it.
And regarding the failing tests, I believe it's because I was using --editable
when installing pudl
dev branch via pip so I could test local changes to pudl
. This was installing it directly in the root directory of this repo, and it looks like pytest was then recursively running tests from the pudl
repo, which were failing due to environment issues. I can fix this, but I believe the tests are still going to fail until the changes I made to pudl
are merged into dev
.
Hmm, @zschira it looks like real test failures having to do with the Zenodo API. Did the tests work locally?
@zaneselvans I believe I have an environment setup issue so I haven't been able to run this specific test yet locally (I need to set ZENODO_SANDBOX_TOKEN_PUBLISH
). If you or someone else can point me to how to get my environment configured properly for this, I'll work on debugging locally
Added changes to use Catalyst as the only contributor in the raw datapackage descriptors, and tests to verify that those descriptors are valid.
The tests currently failing are caused by lack of keywords for censusdp1tract
and ferc714
, which are required to create a valid datapackage descriptor. These issues are addressed in this pull request.
The tests that depend on connecting to Zenodo are failing, and we're getting an error message that indicates we're being blocked. So.... not sure what happened there. I don't know how we could have hit the API limit, even with all this testing and messing around. And the tests pass when I run them locally, so maybe it's an IP address block?
In any case, this all seems to be working as far as our end of things is concerned, and the CI is running fine on all the other tests at the moment...
There is a lot of duplication between our Zenodo metadata, datapackages and
pudl.metadata
. This PR addresses duplications between datapackage generation andpudl.metadata
module.I think a majority of data in our datapackage.json files can be generated from
pudl.metadata
but we need to add a couple of things topudl.metadata
first. I left some questions in my review @zaneselvansResolves catalyst-cooperative/pudl#1419