catalyst-cooperative / pudl-catalog

An Intake catalog for distributing open energy system data liberated by Catalyst Cooperative.
https://catalyst.coop/pudl/
MIT License
9 stars 2 forks source link

Replace storage_option with AWS S3 bucket #49

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

PUDL has been accepted to the Open Data Sponsorship Program on AWS, which covers storage and egress fees of S3 buckets that contain our data. This is great news because our users won't have to set up a GCP account to deal with requester pays.

Tasks:

AWS Notes

AWS MFA Notes

zaneselvans commented 1 year ago

Notes on the AWS Open Data Program terms and conditions:

(g) provide AWS with information reasonably requested concerning End User use of Program Content.

Do we have any idea what this would entail?

Your participation in the Program will terminate two (2) years from the Effective Date.

I think this argues for continuing to ensure that the data is also deposited in the intake.catalyst.coop bucket, and that the storage URL is configurable, so that current and past versions of the data will be available long term regardless of how AWS changes the program.

You may not issue any press release or other public statement with respect to your participation in the Program unless we approve in advance and in writing.

This seems kind of ridiculous. Do we need to get their permission to mention it in the README of the pudl-catalog repo or our documentation? Or to write a blog post? Or to tweet about it? I guess it doesn't particularly matter that much as long as we can mysteriously provide free access to all of the data outputs.

zaneselvans commented 1 year ago

Maybe we should have a quick board meeting item with notes tomorrow to give you the power to enter into The Agreement.

bendnorman commented 1 year ago

Just emailed them with your questions. I'll add this to tomorrow's board meeting agenda if they respond in time.

I think they only need to improve press releases but they don't need to approve supporting social media posts as long as they follow the PR guidelines: image.png

bendnorman commented 1 year ago

AWS response:

On twitter and documentation:

Hi Ben, B.1 in https://assets.opendata.aws/aws-onboarding-handbook-for-data-providers-en-US.pdf should help answer the first question, but no need to ask us for permission to talk about your participation on Twitter or documentation.

Can we assume we'll get a renewal after two years?

For renewals, we do not post a percentage, but if we did, it’d be close to 100%. You can see a number of the datasets listed at https://registry.opendata.aws have been there for some time and are not disappearing because we are not renewing. Once data is made available through the program, it’s a bad experience to have it removed if users are depending on it. So while we retain the right not to renew, we generally renew.

What does "(g) provide AWS with information reasonably requested concerning End User use of Program Content." mean?

To your last question, from time to time we may make requests for use cases around the data usage to help us show the value of the data publicly. If you can share, great! If not, it’s not a problem. Even though the language is a bit vague, we actually cannot even accept any more detailed information (like detailed usage by customer, if you had it) without some sort of data sharing agreement in place.

bendnorman commented 1 year ago

Something I didn't consider was the egress fees from the GCP vms to the AWS bucket. Each time the nightly builds succeeds we'll need to copy the outputs from the VM to the AWS bucket. We currently output about 11 GB of data. Premium network egress pricing is $0.12/GB and standard is $0.085/GB (the VM network is set to premium by default but this can be changed). Assuming all of our nightly builds succeed, the aws cp doesn't do any compression and we use the standard network we'll end up spending about $250 on egress in a year. I think this is a reasonable price to pay for our users to have free access to the data and for us to continue to use our GCP nightly builds. An alternative would be to migrate our nightly build infrastructure to AWS but that doesn't feel worth it.

zaneselvans commented 1 year ago

@bendnorman were there still tasks under this issue that need to be completed?

bendnorman commented 1 year ago

This is done for now. I will hold off on adding PUDL to the AWS quarterly newsletter until the catalog is more useable.