catalyst-cooperative / pudl-zenodo-storage

Tools for creating versioned archives of raw data on Zenodo using Frictionless data packages.
MIT License
2 stars 2 forks source link

WIP: Create datapackage metadata from pudl metadata #15

Closed bendnorman closed 2 years ago

bendnorman commented 2 years ago

There is a lot of duplication between our Zenodo metadata, datapackages and pudl.metadata. This PR addresses duplications between datapackage generation and pudl.metadata module.

I think a majority of data in our datapackage.json files can be generated from pudl.metadata but we need to add a couple of things to pudl.metadata first. I left some questions in my review @zaneselvans

Resolves catalyst-cooperative/pudl#1419

zaneselvans commented 2 years ago

Re: the awkward data integration flow... yeah. Since this repo is really just for internal use, maybe it would make sense to have it depend on the PUDL dev branch at least?

zschira commented 2 years ago

All the frictionless data-package metadata is now being generated from the DataSource objects now. I've also updated this repo to depend on pudl dev branch. I also wanted to ask if we want to use metadata from the pudl repo in zs/metadata.py? Seems a lot of the metadata stored here is slightly different from anything elsewhere, except for the keywords, which can probably be easily replaced.

zaneselvans commented 2 years ago

I think anything in zs.metadata that we can accurately generate using the programmatically accessible metadata we're already going to be maintaining int he PUDL repo should be replaced with references to that metadata. Looking at an example of the metadata that's being stored for Zenodo (i.e. the non-datapackage metadata, that's directly associated with the Zenodo deposit) I think we can probably...

ferc1_uuid = "d3d91c87-c595-49d5-a7f3-e5f5669c8306"
ferc1 = {
    "title": "PUDL Raw FERC Form 1",
    "language": "eng",
    "upload_type": "dataset",
    "description": "<p>Raw Federal Energy Regulatory Commission (FERC) Form 1 data, "
                   "archived from\n"
                   "<a href=\"https://www.ferc.gov/industries-data/electric/"
                   "general-information/electric-industry-forms/"
                   "form-1-electric-utility-annual\">"
                   "https://www.ferc.gov/industries-data/electric/"
                   "general-information/electric-industry-forms/"
                   "form-1-electric-utility-annual</a></p>"
                   f"{pudl_description}",
    "creators": creators,
    "access_right": "open",
    "license": "other-pd",
    "keywords": [
        "electricity", "electric", "utility", "plant", "steam", "generation",
        "cost", "expense", "price", "heat content", "ferc", "form 1",
        "federal energy regulatory commission", "capital", "accounting",
        "depreciation", "finance", "plant in service", "hydro", "coal",
        "natural gas", "gas", "opex", "capex", "accounts", "investment",
        "capacity", "usa", ferc1_uuid
    ]
}
zschira commented 2 years ago

Sounds good, I'll go ahead and make those changes. For the contributors I'll use all the contributors unless we have a specific subset in mind

zschira commented 2 years ago

Updated the metadata in zs/metadata.py. Second question I forgot to mention earlier: how do we want to handle eipinfrastructure and epaipm? These sources currently don't exist in pudl.

zaneselvans commented 2 years ago

The epaipm is deprecated and can/should be removed.

Would it make sense to define a new DataSource locally for those that don't exist in the main PUDL repo, and use that object to generate the metdata? Then at least the interface is the same for all of them, and we'd be able to easily migrate a dataset from the scraper/archives only into PUDL if/when that made sense.

@bendnorman is the EIP Infrastructure actually coming from the Zenodo archives for the DBCP work? And should it be? Or was that going to go into the GitHub LFS in that repo?

zaneselvans commented 2 years ago

Is there a dependency issue of some kind going on here?

 ==================================== ERRORS ====================================
_ ERROR collecting src/catalystcoop-pudl/test/unit/workspace/datastore_test.py _
ImportError while importing test module '/home/runner/work/pudl-zenodo-storage/pudl-zenodo-storage/src/catalystcoop-pudl/test/unit/workspace/datastore_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
src/catalystcoop-pudl/test/unit/workspace/datastore_test.py:8: in <module>
    import responses
E   ModuleNotFoundError: No module named 'responses'
zschira commented 2 years ago

I'll go ahead and remove epaipm. And I think it makes sense to have a local DataSource definition. That's probably the most consistent way to handle it.

And regarding the failing tests, I believe it's because I was using --editable when installing pudl dev branch via pip so I could test local changes to pudl. This was installing it directly in the root directory of this repo, and it looks like pytest was then recursively running tests from the pudl repo, which were failing due to environment issues. I can fix this, but I believe the tests are still going to fail until the changes I made to pudl are merged into dev.

zaneselvans commented 2 years ago

Hmm, @zschira it looks like real test failures having to do with the Zenodo API. Did the tests work locally?

zschira commented 2 years ago

@zaneselvans I believe I have an environment setup issue so I haven't been able to run this specific test yet locally (I need to set ZENODO_SANDBOX_TOKEN_PUBLISH). If you or someone else can point me to how to get my environment configured properly for this, I'll work on debugging locally

zschira commented 2 years ago

Added changes to use Catalyst as the only contributor in the raw datapackage descriptors, and tests to verify that those descriptors are valid.

zschira commented 2 years ago

The tests currently failing are caused by lack of keywords for censusdp1tract and ferc714, which are required to create a valid datapackage descriptor. These issues are addressed in this pull request.

zaneselvans commented 2 years ago

The tests that depend on connecting to Zenodo are failing, and we're getting an error message that indicates we're being blocked. So.... not sure what happened there. I don't know how we could have hit the API limit, even with all this testing and messing around. And the tests pass when I run them locally, so maybe it's an IP address block?

In any case, this all seems to be working as far as our end of things is concerned, and the CI is running fine on all the other tests at the moment...