Closed fossell closed 1 year ago
On 1/20/23, @JohnHalleyGotway met with @hahnd to reinstall the existing wrfcloud instance. Made several notes about that process (details below) and recorded the meeting (see Shared Google Drive).
On 1/25/23, @JohnHalleyGotway will step through the instructions to install WRFCloud as johnhg-metplus in the METplus AWS Instance.
@JohnHalleyGotway will update the documentation on the feature/auto-install
branch directly with details from installing in the WRFCloud and METplus AWS Instances.
Details...
Recommend running as an AWS user with 'AdministratorAccess' which is available to every account. Amazon Resource Name = 'arn:aws:iam::aws:policy/AdministratorAccess'
Testing installation instructions within the existing WRFCloud AWS instance, as a member of the 'admin' group in AWS WRFCloud which is a custom group defined for this instance rather than a standard one.
--
was converted to a hyphen.
git clone --branch feature/auto-install https://github.com/NCAR/wrfcloud
Note that feature/auto-install will be replaced by a tagged version number.
Recommend defining the tag or branch name as an environment variable at the beginning of the instructions.
$WRFCLOUD_BUILD_VERSION ?The compilation of Python3.9 takes a long time during the bootstrap process. Consider searching for a pre-built package through yum for this instead.
Verify the install.
Questions to be answered:
Need to add users.
Recommend documenting the process for...
METplus AWS Installation Steps (refer to the WRFCloud Recording):
Ohio
(i.e. US East (Ohio), us-east-2
) and reconfirm with each step below.Route 53
service confirm at least one domain is available in Hosted zones
.EC2
service, select Limits
, search for HPC
. Confirmed that the limit of 768 vCPUs > 96. If < 96, submit a request to AWS to increase that limit.SES
service (e.g. Amazon Simple Email Service), select Account dashboard
and check for a banner message warning "Your Amazon SES account is in the sandbox in US East (Ohio)".
us-east-2
region.
b. In the SES
service, select Verified identities
and confirm that we have at least one.
c. In the Notifications
tab of the verified identity, have Feedback forwarding
enabled. And have feedback notifications set to go to an SNS topic. Might not be required, but sometimes they will ask about it.NOTE: I don't have sufficient permission in METplus for 5b above:
You do not have sufficient access to perform this action.
User: arn:aws:iam::707838134870:user/johnhg-metplus is not authorized to perform: ses:CreateEmailIdentity on resource: arn:aws:ses:us-east-2:707838134870:identity/metpluscloud.com because no identity-based policy allows the ses:CreateEmailIdentity action
CloudShell
icon from the top menu bar but did not have sufficient permission:
Unable to start the environment. You don't have required permissions. Ask your IAM administrator for access to AWS CloudShell. System error: User: arn:aws:iam::707838134870:user/johnhg-metplus is not authorized to perform: cloudshell:CreateEnvironment on resource: arn:aws:cloudshell:us-east-2:707838134870:*
Met with Deidre on 2/1/23:
On 2/7/23, made it almost all the way through the WRF Cloud install in the METplus AWS instance. But here's the error:
Please check your email johnhg@ucar.edu and click the link to confirm.
Traceback (most recent call last):
File "/opt/python/bin/wrfcloud-setup", line 33, in <module>
sys.exit(load_entry_point('wrfcloud==0.1.0', 'console_scripts', 'wrfcloud-setup')())
File "/opt/python/lib/python3.9/site-packages/wrfcloud/setup/__init__.py", line 41, in setup
_create_cluster_policy()
File "/opt/python/lib/python3.9/site-packages/wrfcloud/setup/__init__.py", line 452, in _create_cluster_policy
res = iam.create_policy(
File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/python/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the CreatePolicy operation: A policy called wrfcloud_parallelcluster already exists. Duplicate names are not allowed.
I see this mentioned in the uninstall instructions and will work on fleshing those out as well. After fleshing out and following the uninstall instructions, I was then able to successfully install in METplus!
Found that a previous partial install caused the following installation error. Listed below is direction from @hahnd:
The angular libraries should not be installed until after the layer zip file is created:
### Create WRF Cloud Build Artifacts
create_wrfcloud_lambda_layer
create_wrfcloud_lambda_function
### Compile angular web application
install_angular14
Somehow, you ended up with angular installed before the lambda layer zip file was created. Probably from a previous install attempt.
We should either delete rm -Rf ~/.nvm/versions/node/v16.19.0/lib/node_modules/@angular
before creating the lambda layer zip file; or be explicit in what we include in the lambda layer zip file:
zip -r "${build_dir}/lambda_layer.zip" python/lib node/bin node/include node/share node/lib/node_modules/corepack node/lib/node_modules/npm
Either of these changes would go into the create_wrfcloud_lambda_layer function in install_bootstrap.sh.
John Halley Gotway 10:58 AM Makes sense. I think we should expect users to fail at least once in their installation attempts! Is there anything I should add to the uninstall steps? And which of these 2 options do you prefer? Is either a more robust solution?
David Hahn 10:59 AM I suppose the latter is probably better. It would protect against the rare case where users have other node libraries installed in their CloudShell environment. I think we should address it in the scripting and would not need to update the uninstall docs.
Finally got a successful build in the METplus AWS account!
WRF Cloud installation is complete.
Open your browser to https://app.metpluscloud.org
However, going to that URL results in the same odd "Download" behavior that David noted in the past with wrfcloud.com. @hahnd, please advise.
Notes from 2/9/23:
Not sure why? No geo_em files exist for my manually added configuration, and perhaps the logic for automatically creating them as needed doesn't either?
The website works for me now, even going to https://app.metpluscloud.org. I have my browser configure not to cache anything, so you might just need to clear your cache and try again.
Looks like I forgot to update the UI to validate the configuration name value. The API still validates the request, so you get an error. You can only use alphanumeric, -, and _ in the name. Need to add a bugfix/enhancement to address this issue.
In general, you can find errors from the API in CloudWatch Logs. Find the log group for production_wrfcloud_handler
and search for the reference ID:
@hahnd thanks for the tip on the logs. From the perspective of installation instructions for this issue, I'm wondering if I need to document the procedure you described in scratch to avoid the odd "download" behavior. Or if just going to "/login" avoids it in the first place then that'd be easier. But it'll take another round of testing to determine what's actually required.
I'll do that tomorrow.
If going to ".../login" avoids it entirely, that seems simpler. If it doesn't, then I'll write up your instructions from Slack.
The WRF job failed because there's no wrf AMI in my account to run. But I did get an AMI in prior testing. Any guesses as to the issue? I did merge in recent changes from the develop branch today into my feature branch. Perhaps recent develop changes impacts the AMI creation step?
I want to check with AWS again on this behavior. It should not be necessary. Maybe something is missing when we upload the files. The root of the problem is that the mime type of the response is set to application/octet-stream instead of text/html, so the browser handles it differently.
Going to /login (or any other path) will actually download the exact same file. That is how CloudFront is configured and must be configured like that for the Angular application. The Angular Router handles content switching based on the path. However, your browser did not have /login cached, so the browser expected a different file and went to the server instead its cache. @. src]$ curl https://app.metpluscloud.org/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c - @. src]$ curl https://app.metpluscloud.org/login/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c - @.*** src]$ curl https://app.metpluscloud.org/jobs/ 2> /dev/null | shasum -a 256 e4e618d7c0dd9f9ae37cf62ac0feb02174d07902288c4e2a315642db5a8b395c -
Is the AMI still building? If not, check for errors in CloudWatch Logs. Find the imagebuilder/wrf-4-4-0 Log Group and look at Log Streams in there.
On Feb 9, 2023, at 4:51 PM, John Halley Gotway @.***> wrote: @hahnd https://github.com/hahnd thanks for the tip on the logs. From the perspective of installation instructions for this issue, I'm wondering if I need to document the procedure you described in scratch to avoid the odd "download" behavior. Or if just going to "/login" avoids it in the first place then that'd be easier. But it'll take another round of testing to determine what's actually required.
I'll do that tomorrow.
If going to ".../login" avoid it entirely, that seems simpler. If it doesn't, then I'll write up your instructions from Slack.
The WRF job failed because there's no wrf AMI in my account to run. But I did get an AMI in prior testing. Any guesses as to the issue? I did merge in recent changes from the develop branch today into my feature branch. Perhaps recent develop changes impacts the AMI creation step?
— Reply to this email directly, view it on GitHub https://github.com/NCAR/wrfcloud/issues/126#issuecomment-1424988952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB24LJU6GWWLIPNEQ3KOODLWWV7IVANCNFSM6AAAAAAUGVLD6I. You are receiving this because you were mentioned.
Remaining work to done on the feature/126-install-docs
branch.
app.metpluscloud.org
to confirm that.
Feedback from @brukerd sent via email on 2/16/23. Thanks for the feedback! Since these are doc-only changes, I'll just commit them directly to the develop branch...
My suggestions: In the AWS Management Console, use the top-level search bar to find and launch the AWS IAM (Identity and Access Management) Service.
If you are unable to launch the AWS IAM Service, you do not have sufficient permissions.
Make the second bullet point a sub-bullet of the first one; it pertains to the first bullet point.
In Access management > Users, find and select your user name, and inspect the Permissions policies.
Ensure that you have AdministratorAccess, SystemAdministrator, or higher permissions.
Same as above. Make the second bullet a sub-bullet of the first one.
Anyplace you mention contacting AWS Support, if the user isn't utilizing the root account for their AWS VPC, they may need a support policy to be defined. By default, only the root account can make support requests. AWS documentation on creating the policy/policies is here: https://docs.aws.amazon.com/awssupport/latest/user/accessing-support.html
Select the US East (Ohio) / us-east-2 region from the top-right dropdown navigation. Is there a reason you're specifying a region like this instead of just mentioning they need to pick their closest region? This is potentially a limitation for non-US users. This is also something that they will need to take into account when making support requests for higher vCPU limits.
Describe the Task
Document the steps to install the system on Users Guide.
These documentation updates should be done in conjunction with testing the installation of WRF Cloud within the METplus AWS account. After that installation is complete, be sure to test by running WRF with sub-hourly output. That will test that #113 works as expected.
Time Estimate
1 day
Sub-Issues
Consider breaking the task down into sub-issues.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Define the Metadata
Assignee
Labels
Projects and Milestone
Task Checklist
feature_<Issue Number>/<Description>
feature <Issue Number> <Description>