catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 1 forks source link

Permissions errors when adding new dataset to Zenodo #38

Closed e-belfer closed 1 year ago

e-belfer commented 1 year ago

Ran into some permissions issues trying to test a new source of data (FERC EQR) in the Zenodo sandbox.

pudl_archiver --datasets ferceqr --dry-run requires the production token at present, returning: `KeyError: 'ZENODO_TOKEN_UPLOAD'

pudl_archiver --datasets ferceqr --initialize returns the same error, also requiring the production token.

pudl_archiver --datasets ferceqr --sandbox returns a KeyError for the dataset name in the Zenodo API client, breaking on the line settings = self.dataset_settings[data_source_id] in /zenodo/api_client.py.

In other words, it seems to be impossible to test a new dataset (either locally or in the sandbox) without updating the dataset settings entry, which requires access to the production keys. This seems to be an undesireable outcome of some of the more recent refactoring changes.

e-belfer commented 1 year ago

@zschira @jdangerx Flagging this bug as you're working on your refactoring PR.

jdangerx commented 1 year ago

Thanks for flagging this! I'll take a look. The first two dry runs needing production keys make sense. I think that we should not need the production key to update the dataset settings, so I'll look at how that is working...

katie-lamb commented 1 year ago

I have this issue as well but I'm assuming that @e-belfer and I don't have access to the secrets for this repo. So maybe one of @zschira or @jdangerx should just send us the ZENODO_TOKEN_UPLOAD for now and we can set it as an environment variable.

e-belfer commented 1 year ago

Update to note that running pudl_archiver --datasets ferceqr --initialize --sandbox with the sandbox tokens stored in .env returns the known field length error but otherwise does not prompt any token issues

jdangerx commented 1 year ago

@e-belfer do you have a branch that you've been working off of? Running this on main tells me that it doesn't know what ferceqr is.

e-belfer commented 1 year ago

@jdangerx Just pushed to ferceqr but note this is as of yet untested.

jdangerx commented 1 year ago

Sweet, thanks! When I run pudl_archiver --datasets ferceqr --initialize --sandbox I get the "no creators" issue that should be fixed with #42 . Is that what you're getting?

    raise ZenodoClientException(
pudl_archiver.depositors.zenodo.ZenodoClientException: ZenodoClientException({'status': 400, 'message': 'Validation error.', 'errors': [{'field': 'metadata.creators', 'message': 'Shorter than minimum length 1.'}]})
jdangerx commented 1 year ago

If I rebase onto the small_fixes branch that is in #42 and run I get:

> pudl_archiver --datasets ferceqr --initialize --sandbox
2023-01-26 16:00:37 [    INFO] catalystcoop.pudl_archiver.depositors.zenodo:90 POST https://sandbox.zenodo.org/api/deposit/depositions - Create new deposition
2023-01-26 16:00:39 [    INFO] catalystcoop.pudl_archiver.archivers.classes:75 Archiving ferceqr
2023-01-26 16:00:41 [ WARNING] catalystcoop.pudl_archiver.archivers.classes:155 The archiver couldn't find any hyperlinks that match re.compile('CSV_(\\d{4})_Q([1-4]).zip').Make sure your filter_pattern is correct or if the structure of the https://eqrreportviewer.ferc.gov/ page changed.
Encountered exceptions, showing traceback for last one: ["('ferceqr', AssertionError())"]
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/bin/pudl_archiver", line 8, in <module>
    sys.exit(main())
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/cli.py", line 58, in main
    asyncio.run(archive_datasets(**vars(args)))
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/__init__.py", line 99, in archive_datasets
    raise exceptions[-1][1]
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/__init__.py", line 47, in archive_dataset
    await archiver.create_archive()
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/classes.py", line 175, in create_archive
    resource_info = await resource_coroutine
  File "/Users/dazhong-catalyst/mambaforge/envs/pudl-archiver-toml/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-8' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-9' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-10' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-11' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-12' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-14' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-15' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-16' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
Task exception was never retrieved
future: <Task finished name='Task-17' coro=<FercEQRArchiver.get_year_dbf() done, defined at /Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py:45> exception=AssertionError()>
Traceback (most recent call last):
  File "/Users/dazhong-catalyst/work/pudl-archiver/src/pudl_archiver/archivers/ferc/ferceqr.py", line 49, in get_year_dbf
    assert year >= 2012 and year <= 2002
AssertionError
e-belfer commented 1 year ago

Great, let me rebase in the same way and keep fiddling around with it. Thanks!

jdangerx commented 1 year ago

FYI - #42 got merged so you're good to keep working off main @e-belfer

e-belfer commented 1 year ago

@jdangerx @zschira This was working great before the big merge, but now I'm seeing that line 54 in entitites.py, return cls(name=contributor.title, affiliation=contributor.organization) is returning AttributeError: 'dict' object has no attribute 'title when I run --initialize --sandbox on the ferc eqr data.

jdangerx commented 1 year ago

I can take a look at it this afternoon - let me know if you want to pair!

@jdangerx https://github.com/jdangerx @zschira https://github.com/zschira This was working great before the big merge, but now I'm seeing that line 54 in entitites.py, return cls(name=contributor.title, affiliation=contributor.organization) is returning AttributeError: 'dict' object has no attribute 'title when I run --initialize --sandbox on the ferc eqr data.

— Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl-archiver/issues/38#issuecomment-1408740746, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATBKMU25L5ULC6XXYXU36TWU7HKDANCNFSM6AAAAAAUHXKL2I . You are receiving this because you were mentioned.Message ID: @.***>

zschira commented 1 year ago

@e-belfer and @jdangerx it looks like what's on ferceqr came from an earlier commit from the small-fixes branch. On ferceqr in entities.py there's the line:

creators = [
    DepositionCreator.from_contributor(CONTRIBUTORS["catalyst-cooperative"])
]

It should be:

creators = [
    DepositionCreator.from_contributor(
        Contributor.from_id("catalyst-cooperative")
    )
]

This change should fix the issue. I think this is just a git issue, and if you get the latest from main you should be good to go.

e-belfer commented 1 year ago

Absolutely, my error in rebasing. Fixed, thanks!

e-belfer commented 1 year ago

@jdangerx @zschira Are there any other remaining fundamental permissions issues that are not behaving as expected? If not, I'll go ahead and close out the issue.

jdangerx commented 1 year ago

I think this is behaving as expected!