NCAR / wrfcloud

WRF Cloud Framework
Apache License 2.0
13 stars 6 forks source link

Document installation procedures #126

Closed fossell closed 1 year ago

fossell commented 1 year ago

Describe the Task

Document the steps to install the system on Users Guide.

These documentation updates should be done in conjunction with testing the installation of WRF Cloud within the METplus AWS account. After that installation is complete, be sure to test by running WRF with sub-hourly output. That will test that #113 works as expected.

Time Estimate

1 day

Sub-Issues

Consider breaking the task down into sub-issues.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Define the Metadata

Assignee

Labels

Projects and Milestone

Task Checklist

JohnHalleyGotway commented 1 year ago

On 1/20/23, @JohnHalleyGotway met with @hahnd to reinstall the existing wrfcloud instance. Made several notes about that process (details below) and recorded the meeting (see Shared Google Drive).

On 1/25/23, @JohnHalleyGotway will step through the instructions to install WRFCloud as johnhg-metplus in the METplus AWS Instance.

@JohnHalleyGotway will update the documentation on the feature/auto-install branch directly with details from installing in the WRFCloud and METplus AWS Instances.

Details...

The compilation of Python3.9 takes a long time during the bootstrap process. Consider searching for a pre-built package through yum for this instead.

Verify the install.

Questions to be answered:

  1. Would you like to enable autocompletion? This will set up your terminal so pressing TAB while typing Angular CLI commands will show possible options and autocomplete arguments. (Enabling autocompletion will modify configuration files in your home directory.) Yes or No, doesn't really matter. Would like to get rid of this question if possible.
  2. Which domain name would you like to use? (wrfcloud.com)
  3. Enter host name for web application: [app.wrfcloud.com]
  4. Enter host name for REST API: [api.wrfcloud.com]:
  5. Enter host name for websocket API: [ws.wrfcloud.com]:
  6. Enter administrator's full name:
  7. Enter email address for application administrator:
  8. Enter administrator's new password:
  9. Do you want to install example model configurations (yes/no)? yes
  10. Do you want to upload an SSH public key for an admin? Without it, you cannot access your clusters to debug. (yes/no)?
  11. Paste your public key, often found at ${HOME}/.ssh/id_rsa.pub:
  12. Confirm via email. (Congratulations! You have successfully verified an email address. You can now start sending email from this address.) Please check your email johnhg@ucar.edu and click the link to confirm. Creating WRF Image... WrfIntelImageBuilder ... CREATE_IN_PROGRESS ... 2023-01-20 17:46:22 ... ... Takes approximated 10-20 minutes for the website to become available... Open CloudFormation AWS service to monitor build progress. Monitor progress via the 'Events' tab. Also monitor cloud shell and watch for an alert telling me to got to the URL.

Need to add users.

Recommend documenting the process for...

  1. Pre-Requisites REQUIRES ADMIN PRIVILEGE ON AWS ACCOUNT
    • AWS Account
    • [x] Testing as johnhg-metplus in the METplus AWS Instance.
    • [ ] TODO: Define minimum set of permissions required.
    • Domain name in Route 53
    • [ ] TODO: select one
    • AWS account limit for “Running On-Demand All HPC instances” in us-east-2 must be 96 vCPUs or higher
    • [x] After logging in, selected the US East (Ohio) "us-east-2" region.
    • [ ] TODO: Figure out how to check and/or set this.
    • Simple Email Service (SES) must be out of the sandbox in us-east-2
    • [ ] TODO: Check this
JohnHalleyGotway commented 1 year ago

METplus AWS Installation Steps (refer to the WRFCloud Recording):

  1. Log on to METplus AWS Instance.
  2. Confirm location is Ohio (i.e. US East (Ohio), us-east-2) and reconfirm with each step below.
  3. In Route 53 service confirm at least one domain is available in Hosted zones.
  4. In EC2 service, select Limits, search for HPC. Confirmed that the limit of 768 vCPUs > 96. If < 96, submit a request to AWS to increase that limit.
  5. In SES service (e.g. Amazon Simple Email Service), select Account dashboard and check for a banner message warning "Your Amazon SES account is in the sandbox in US East (Ohio)".
    • NOTE: METplus is in the Sandbox and I did "Request production access".
    • AWS responded requesting more details about our use case. @hahnd notes that we should: a. Confirm that we requested production access in the us-east-2 region. b. In the SES service, select Verified identities and confirm that we have at least one. c. In the Notifications tab of the verified identity, have Feedback forwarding enabled. And have feedback notifications set to go to an SNS topic. Might not be required, but sometimes they will ask about it.

NOTE: I don't have sufficient permission in METplus for 5b above:

You do not have sufficient access to perform this action.
User: arn:aws:iam::707838134870:user/johnhg-metplus is not authorized to perform: ses:CreateEmailIdentity on resource: arn:aws:ses:us-east-2:707838134870:identity/metpluscloud.com because no identity-based policy allows the ses:CreateEmailIdentity action
  1. Clicked CloudShell icon from the top menu bar but did not have sufficient permission:
    Unable to start the environment. You don't have required permissions. Ask your IAM administrator for access to AWS CloudShell. System error: User: arn:aws:iam::707838134870:user/johnhg-metplus is not authorized to perform: cloudshell:CreateEnvironment on resource: arn:aws:cloudshell:us-east-2:707838134870:*
JohnHalleyGotway commented 1 year ago

Met with Deidre on 2/1/23:

JohnHalleyGotway commented 1 year ago

On 2/7/23, made it almost all the way through the WRF Cloud install in the METplus AWS instance. But here's the error:

Please check your email johnhg@ucar.edu and click the link to confirm.
Traceback (most recent call last):
  File "/opt/python/bin/wrfcloud-setup", line 33, in <module>
    sys.exit(load_entry_point('wrfcloud==0.1.0', 'console_scripts', 'wrfcloud-setup')())
  File "/opt/python/lib/python3.9/site-packages/wrfcloud/setup/__init__.py", line 41, in setup
    _create_cluster_policy()
  File "/opt/python/lib/python3.9/site-packages/wrfcloud/setup/__init__.py", line 452, in _create_cluster_policy
    res = iam.create_policy(
  File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the CreatePolicy operation: A policy called wrfcloud_parallelcluster already exists. Duplicate names are not allowed.

I see this mentioned in the uninstall instructions and will work on fleshing those out as well. After fleshing out and following the uninstall instructions, I was then able to successfully install in METplus!

JohnHalleyGotway commented 1 year ago

Found that a previous partial install caused the following installation error. Listed below is direction from @hahnd:

The angular libraries should not be installed until after the layer zip file is created:

  ### Create WRF Cloud Build Artifacts
  create_wrfcloud_lambda_layer
  create_wrfcloud_lambda_function

  ### Compile angular web application
  install_angular14

Somehow, you ended up with angular installed before the lambda layer zip file was created. Probably from a previous install attempt.

We should either delete rm -Rf ~/.nvm/versions/node/v16.19.0/lib/node_modules/@angular before creating the lambda layer zip file; or be explicit in what we include in the lambda layer zip file: zip -r "${build_dir}/lambda_layer.zip" python/lib node/bin node/include node/share node/lib/node_modules/corepack node/lib/node_modules/npm Either of these changes would go into the create_wrfcloud_lambda_layer function in install_bootstrap.sh.

John Halley Gotway 10:58 AM Makes sense. I think we should expect users to fail at least once in their installation attempts! Is there anything I should add to the uninstall steps? And which of these 2 options do you prefer? Is either a more robust solution?

David Hahn 10:59 AM I suppose the latter is probably better. It would protect against the rare case where users have other node libraries installed in their CloudShell environment. I think we should address it in the scripting and would not need to update the uninstall docs.

JohnHalleyGotway commented 1 year ago

Finally got a successful build in the METplus AWS account!

WRF Cloud installation is complete.
Open your browser to https://app.metpluscloud.org

However, going to that URL results in the same odd "Download" behavior that David noted in the past with wrfcloud.com. @hahnd, please advise.

https://user-images.githubusercontent.com/21087144/217919260-e8b1bdf1-b893-4b69-a8ac-2014544f8ea7.mov

JohnHalleyGotway commented 1 year ago

Notes from 2/9/23:

Not sure why? No geo_em files exist for my manually added configuration, and perhaps the logic for automatically creating them as needed doesn't either?

hahnd commented 1 year ago

The website works for me now, even going to https://app.metpluscloud.org. I have my browser configure not to cache anything, so you might just need to clear your cache and try again.

Looks like I forgot to update the UI to validate the configuration name value. The API still validates the request, so you get an error. You can only use alphanumeric, -, and _ in the name. Need to add a bugfix/enhancement to address this issue.

In general, you can find errors from the API in CloudWatch Logs. Find the log group for production_wrfcloud_handler and search for the reference ID:

Screenshot 2023-02-09 at 4 28 47 PM
JohnHalleyGotway commented 1 year ago

@hahnd thanks for the tip on the logs. From the perspective of installation instructions for this issue, I'm wondering if I need to document the procedure you described in scratch to avoid the odd "download" behavior. Or if just going to "/login" avoids it in the first place then that'd be easier. But it'll take another round of testing to determine what's actually required.

I'll do that tomorrow.

If going to ".../login" avoids it entirely, that seems simpler. If it doesn't, then I'll write up your instructions from Slack.

The WRF job failed because there's no wrf AMI in my account to run. But I did get an AMI in prior testing. Any guesses as to the issue? I did merge in recent changes from the develop branch today into my feature branch. Perhaps recent develop changes impacts the AMI creation step?

hahnd commented 1 year ago

I want to check with AWS again on this behavior. It should not be necessary. Maybe something is missing when we upload the files. The root of the problem is that the mime type of the response is set to application/octet-stream instead of text/html, so the browser handles it differently.

Going to /login (or any other path) will actually download the exact same file. That is how CloudFront is configured and must be configured like that for the Angular application. The Angular Router handles content switching based on the path. However, your browser did not have /login cached, so the browser expected a different file and went to the server instead its cache. @. src]$ curl https://app.metpluscloud.org/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c - @. src]$ curl https://app.metpluscloud.org/login/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c - @.*** src]$ curl https://app.metpluscloud.org/jobs/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c -

Is the AMI still building? If not, check for errors in CloudWatch Logs. Find the imagebuilder/wrf-4-4-0 Log Group and look at Log Streams in there.

On Feb 9, 2023, at 4:51 PM, John Halley Gotway @.***> wrote: @hahnd https://github.com/hahnd thanks for the tip on the logs. From the perspective of installation instructions for this issue, I'm wondering if I need to document the procedure you described in scratch to avoid the odd "download" behavior. Or if just going to "/login" avoids it in the first place then that'd be easier. But it'll take another round of testing to determine what's actually required.

I'll do that tomorrow.

If going to ".../login" avoid it entirely, that seems simpler. If it doesn't, then I'll write up your instructions from Slack.

The WRF job failed because there's no wrf AMI in my account to run. But I did get an AMI in prior testing. Any guesses as to the issue? I did merge in recent changes from the develop branch today into my feature branch. Perhaps recent develop changes impacts the AMI creation step?

— Reply to this email directly, view it on GitHub https://github.com/NCAR/wrfcloud/issues/126#issuecomment-1424988952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB24LJU6GWWLIPNEQ3KOODLWWV7IVANCNFSM6AAAAAAUGVLD6I. You are receiving this because you were mentioned.

JohnHalleyGotway commented 1 year ago

Remaining work to done on the feature/126-install-docs branch.

JohnHalleyGotway commented 1 year ago

Feedback from @brukerd sent via email on 2/16/23. Thanks for the feedback! Since these are doc-only changes, I'll just commit them directly to the develop branch...

My suggestions: In the AWS Management Console, use the top-level search bar to find and launch the AWS IAM (Identity and Access Management) Service.

If you are unable to launch the AWS IAM Service, you do not have sufficient permissions.

Make the second bullet point a sub-bullet of the first one; it pertains to the first bullet point.

In Access management > Users, find and select your user name, and inspect the Permissions policies.

Ensure that you have AdministratorAccess, SystemAdministrator, or higher permissions.

Same as above. Make the second bullet a sub-bullet of the first one.

Anyplace you mention contacting AWS Support, if the user isn't utilizing the root account for their AWS VPC, they may need a support policy to be defined. By default, only the root account can make support requests. AWS documentation on creating the policy/policies is here: https://docs.aws.amazon.com/awssupport/latest/user/accessing-support.html

Select the US East (Ohio) / us-east-2 region from the top-right dropdown navigation. Is there a reason you're specifying a region like this instead of just mentioning they need to pick their closest region? This is potentially a limitation for non-US users. This is also something that they will need to take into account when making support requests for higher vCPU limits.