determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.97k stars 347 forks source link

🐛[bug] #7909

Open humbleearth opened 11 months ago

humbleearth commented 11 months ago

Describe the bug

I am trying to deploy determined to aws using det deploy command. When I do it, it shows below error. There is no cloud formation stack created or failing. The pre-requisites say it just needs a cloudformation api and an key but from error it seems it needs some other permissions too. Its trying to check for a check on ec2/L-DB2E81BA instance type. Can you please share more info on this? Also is there a direct link to a cloud formation template for determined which can be run using cloud formation command line?

Failed to check AWS instance quota: failed to fetch service quota: An error occurred (AccessDeniedException) when calling the
 GetServiceQuota operation: User: arn:aws-region:sts::123123213123:assumed-role/mlops is not authorized to perform: 
servicequotas:GetServiceQuota on resource: arn:aws-region:servicequotas:region:123123123:ec2/L-DB2E81BA with an explicit deny 
in an identity-based policy
Starting Determined Deployment
Determined Version: 0.25.1
Stack Name: determined
AWS Region: region
Keypair: determined
Checking if the SSH Keypair (determined) exists: True
Checking if the CloudFormation Stack (determined) exists: False - Creating Stack
Creating stack determined. This may take a few minutes... Check the CloudFormation Console for updates
Stack Deployment Failed. Check the AWS CloudFormation Console for details.

Reproduction Steps

  1. det deploy aws up --cluster-id determined --keypair determined --deployment-type simple

Expected Behavior

determined cluster should deploy without any permission error check.

Screenshot

determined cluster deployed

Environment

Additional Context

No response

ioga commented 11 months ago

hello,

There is no cloud formation stack created or failing.

This is unexpected. Are you sure you were looking at the right aws account and region?

determined cluster should deploy without any permission error check.

the quotas check is informational only. per the log output, after the quotas error has happened, the code has proceeded with trying to create the stack.

Also is there a direct link to a cloud formation template for determined which can be run using cloud formation command line?

https://github.com/determined-ai/determined/blob/main/harness/determined/deploy/aws/templates/simple.yaml

humbleearth commented 11 months ago

Yes, I think I checked the right account and region.

humbleearth commented 11 months ago

While checking the govcloud.yaml file, I see a usage of AgentConfigFileContents variable parameter which is not defined anywhere.

govcloud.yaml

ioga commented 11 months ago

While checking the govcloud.yaml file, I see a usage of AgentConfigFileContents variable parameter which is not defined anywhere.

govcloud.yaml

thank you for letting us know. we don't have any testing infrastructure for govcloud and mostly rely on user contributions for its maintenance.

this shouldn't be the problem for simple, efs or fsx templates though. is your issue resolved?

humbleearth commented 11 months ago

No, the issue still persists. The det deploy output does not match with the expected cloud formation template. Also the gov cloud region is not there inside the ami mapping even if passed to the det command. So lot of inconsistencies.

ioga commented 11 months ago

I'll add an internal ticket to track AgentConfigFileContents issue.

Also the gov cloud region is not there inside the ami mapping even if passed to the det command.

I assume that was fixed by using govcloud.yaml instead of the simple one.