[DevOps] Enable running benchmarks on reference hardware for weight generation

shannonwells commented 2 years ago

Requires @jacobmfoley assistance for infrastructure setup

Once reference hardware is decided we need to run our benchmarks on that hardware and update all the weights files.

Partially Blocked by:

279
352

Expected

Working with DevOps to get the Reference Hardware Spun Up
The EC2 launch templates and the script to run benchmarks for all pallets will be developed in #279
"Somehow" Trigger updating the weights on a branch
Update runs weights on the reference hardware and commits back to the branch the updated weights
User can then merge branch
The benchmarks should be run against --chain mainnet spec per this comment

Acceptance Criteria

[x] As a developer I should be able to request that the weights be run by entering make benchmarks in a PR comment
[ ] As the repo, I should block merging until I must have the weights updated as the last commit if anything changes that might change the weight (rust files?)

Tasks

[x] Contact @jacobmfoley to obtain AWS security credentials
[x] Read Substrate material on benchmarking
[x] Set up local to run benchmarks
[x] Create a proposed solution diagram
[x] Collect and implement feedback on diagram
[x] ~~Obtain AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from @jacobmfoley to access AWS from GitHub~~
[x] Clarify instance on demand vs self hosted runner with @jacobmfoley
[x] Update diagram with the final approach

Blocked by:

377

Helpful Links

demisx commented 2 years ago

@aramikm My understanding that we don't have production profile set up at the moment. I only see profile.release configured, so this will not work:

cargo build --profile=production --features runtime-benchmarks

Instead, I am planning to compile the binary this way on the referenced hardware prior to running benchmarks:

cargo build --release --features runtime-benchmarks

/cc @saraswatpuneet

demisx commented 2 years ago

@wilwade Per this acceptance criteria:

As the repo, I should block merging until I must have the weights updated as the last commit if anything changes that might change the weight (rust files?)

What if we kick off the benchmarking CI job when files matching certain glob pattern get updated? A developer would still be able to trigger it manually with some predefined action (maybe by entering a slash command /update-weights in a comment?), though. Basically, there will be these two action triggers for benchmarking:

Screen Shot 2022-08-17 at 9 38 38 AM

wilwade commented 2 years ago

@demisx I think we can just go with manual kickoff for now. We could likely create a glob that could do it, but then we would end up triggering it a lot of extra times.

demisx commented 2 years ago

@wilwade Understood. Will stick with the manual trigger for now.

demisx commented 2 years ago

Per @jacobmfoley:

AWS Credentials: You can either configure the cli to use SSO as described here: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html. click on the link as shown in the attached screenshot.
Region: us-east-2
EC2: if starting an instance, launch into private subnet (no public IP addresses), add an owner tag, specify what the purpose of the machine is in the name and or tags, and stop the instance/clean up after you're done.

demisx commented 2 years ago

Here is the proposed approach (high-level). Please let me know your feedback, so I can move forward.

Run Benchmarking on Referenced Hardware (CI)

wilwade commented 2 years ago

@demisx I think in general that looks good. I think the only thing I wonder is if it is worth blocking the dev branch from additional commits instead of just watching for additional commits and stopping the runner if it finds any. I think watching is better, but it might just be easiest to get up and running by not blocking or watching, but instead just fail out if when commit time comes if the head has changed.

PR blocking can happen via a Commit status API: https://docs.github.com/en/rest/commits/statuses

wilwade commented 2 years ago

@demisx One other thing. We should likely limit the commit to just LibertyDSNP org members. That way we can run them on forked PRs, but others cannot trigger it.

demisx commented 2 years ago

@wilwade Thank you for your feedback. Here are my early morning coffeeless brain thoughts:

I think in general that looks good. I think the only thing I wonder is if it is worth blocking the dev branch from additional commits instead of just watching for additional commits and stopping the runner if it finds any.

Great point. If understand correctly, you mean we can utilize the GitHub concurrency group here and have GH cancel running actions if newer commits are pushed to the same ref.

I think watching is better, but it might just be easiest to get up and running by not blocking or watching, but instead just fail out if when commit time comes if the head has changed.

I was originally thinking of this approach, then my concern was what if a developer pushes a new commit while the action is still running (which may take a while)? Do we simply let the old commit in CI fail? What if it was a simple readme file update? The previous weights would be still valid. But, if the benchmarking dependent code was updated instead and the developer didn't include /run-benchmarks slash command then the previously calculated weights may not be valid anymore. Seemed like too many edge cases could arise with this approach. However, if we use concurrency group and let the running action to be abandoned when a new commit is pushed, then I believe we don't have this problem as each new commit will kick off a new run. I would definitely start with the simplest approach.

PR blocking can happen via a Commit status API: https://docs.github.com/en/rest/commits/statuses

Thank you. I'll take a note of that.

One other thing. We should likely limit the commit to just LibertyDSNP org members. That way we can run them on forked PRs, but others cannot trigger it.

Do you mean limit running benchmarks to the LibertyDSNP org members only? In other words, parse the commit and ignore /run-benchmarks slash command in it unless this commit comes from one of the members.

wilwade commented 2 years ago

Do you mean limit running benchmarks to the LibertyDSNP org members only? In other words, parse the commit and ignore /run-benchmarks slash command in it unless this commit comes from one of the members.

Correct.

wilwade commented 2 years ago

Great point. If understand correctly, you mean we can utilize the GitHub concurrency group here and have GH cancel running actions if newer commits are pushed to the same ref.

I wasn't sure if this was going to get kicked off as a GitHub action, but it it does, then that would work.

demisx commented 2 years ago

Updated diagram showing the excluded steps based on the conversation above.

Run Benchmarking on Referenced Hardware (CI) (1)

demisx commented 2 years ago

If I understood correctly, the benchmarks need to be run against a different template depending on whether the pallet is external or not. Here is how I am planning to run benchmarks on referenced hardware in GitHub Actions:

For internal pallets (e.g. pallet_messages, pallet_msa, pallet_schemas):

./target/release/frequency benchmark pallet \
    --chain mainnet \
    --execution=wasm \
    --wasm-execution=compiled \
    --pallet <pallet_name> \
    --extrinsic "*" \
    --steps 20 \
    --repeat 5 \
    --output ./pallets/msa/src/weights.rs \
    --template=./.maintain/frame-weight-template.hbs # <- frame template

For external pallets (e.g. orml_vesting, pallet_scheduler, pallet_utility):

./target/release/frequency benchmark pallet \
    --chain mainnet \
    --execution=wasm \
    --wasm-execution=compiled \
    --pallet <pallet_name> \
    --extrinsic "*" \
    --steps 20 \
    --repeat 5 \
    --output ./runtime/frequency/src/weights \
    --template=./.maintain/runtime-weight-template.hbs # <- runtime template

Please let me know if I missed anything.

demisx commented 2 years ago

@wilwade Just FYI, @shannonwells and I realized that we maybe overlapping here and in #279, because Shannon has already created EC2 launch templates (x86 and ARM), plus the script that runs benchmarks for all pallets in #279 and I was asking questions here on proper commands to run the benchmarks on referenced hardware. We decided that I will focus on the GitHub Actions process/AWS (devOps stuff) here and then we'll plug in Shannon's work in #279 to run benchmarks as part of the GH Actions job. This way we should not duplicate any work.

I also added a note to the description re: templates and script coming from #279

demisx commented 2 years ago

@jacobmfoley What’s the process to add AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY key to GitHub secrets? Do I need to request it through Pivotal? And if GitHub Actions needs to temporarily spin up an EC2 instance, should it do it in the current Dev/Stg account or do we have a CI dedicated AWS account? Thank you.

jacobmfoley commented 2 years ago

@demisx I've avoided adding AWS credentials to Github Actions and have the goal of eliminating all static IAM users all together.

If using actions is a requirement, we can run self hosted actions runners on the required instance sizes.

If not, there is a spot fleet of Jenkins slaves with rust installed with the label "benchmark" ready to run jobs.

demisx commented 2 years ago

@jacobmfoley Thank you for your prompt reply. We already use GitHub actions in this Frequency project and I was planning to keep it simple without introducing Jenkins to the stack. The idea was to launch an EC2 instance on demand when a developer wants to run benchmarks and then destroy it at the end of the actions run. This would let us:

Change referenced hardware instance as needed when the specs change w/o burdening devOps team
Save some $$ compared to keeping it always on as these instances aren't cheap and probably won't be needed daily

Please let me know if it's still OK to proceed with the GitHub Actions and whether it's OK to start the instance on the fly or we should use the self-hosted runner per your suggestion. I'll update the diagram accordingly.

UPDATE: Something to consider if going with self-hosted runners:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 14 days. An ephemeral self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 1 day.
We recommend that you only use self-hosted runners with private repositories. This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow. More info here.
Some of the mentioned above concerns can probably be mitigated leveraging auto-scaling with ephemeral self-hosted runners.

wilwade commented 2 years ago

@demisx Reference hardware is confirmed: #279

demisx commented 2 years ago

@wilwade Got it. Waiting on @jacobmfoley to finalize the approach described in previous comment before moving forward with this. Working on #256 in the meantime.

demisx commented 2 years ago

Per Slack conversation with @wilwade @sbendar @jacobmfoley, there are security concerns using AWS credentials in GitHub Actions as a bad actor can potentially gain access to them. Instead, devOps recommendation is to leverage Jenkins CI as it has much better protections against such incidents.

Next steps for @demisx:

[x] Update the diagram and replace GitHub Actions with Jenkins CI (see image below)
[x] Provide approved reference hardware specs (see #279)
[x] Provide link to @shannonwells's EC2 launch templates
- Log in to "Liberty Stg/Dev" account
- Go to EC2->Launch Templates
- The launch template name is "polkadot-reference" with ID lt-06cd50a2dd2a023de
[x] Provide link to @shannonwells's script to run benchmarks
- The script is here, but it needs some tweaking in #377

After these steps are complete the story is expected to be taken over by the devOps team.

demisx commented 2 years ago

The updated flow diagram with Jenkins CI is attached. Please let me know if I missed anything.

Run Benchmarking on Referenced Hardware (CI)

demisx commented 2 years ago

@jacobmfoley I have provided the outstanding items above. Please let me know if you need anything else. Thank you.

demisx commented 2 years ago

Per @jacobmfoley Slack message, devOps have all info they need at this point to move forward and this will be their next story to work on after the collator automation is finished.

demisx commented 2 years ago

Followed up in devops Slack channel on ETA of the story completion (see quoted below). Awaiting response.

Re: CI Run Benchmarks on Reference Hardware (ASAP). is there a rough ETA when it will be completed? It’s kind of urgent. My understanding the only part left is committing generated benchmark files back to the repo. Thanks.

demisx commented 2 years ago

Per devOps, the updated benchmarks are supposed to be getting committed now. However, it didn't work with @wilwade's test. Also, asked the devOps to update the comment trigger to be the original slash command /run-benchmarks.

wilwade commented 2 years ago

Benchmarks are working via manual trigger with Jenkins.

Looking into self-hosted runners for it eventually: https://github.com/LibertyDSNP/frequency/issues/609

wilwade commented 2 years ago

https://github.com/LibertyDSNP/frequency/wiki/Benchmarking

frequency-chain / frequency