Closed arthurlobo closed 2 years ago
Hi @arthurlobo,
Insufficient quota errors usually result in "ResourceLimitExceeded" exceptions. The error you get seems to be a problem with your credentials. I would look at https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html to make sure you're setup your AWS account credentials correctly.
If you have not run a job successfully with this account before, I would also recommend you take a look at the Hybrid Jobs getting started page https://docs.aws.amazon.com/braket/latest/developerguide/braket-jobs-first.html to ensure you're setup your account to run with Braket. Please verify that the account you're using for setting up Braket is the same you have setup when using the AWS CLI.
I hope this helps. If you continue to see the same error, please let us know.
Thanks!
@krneta I meant to say Amazon Braket SDK instead of AWS CLI. Note that I was able to run the Bell pair circuit on Aspen-11, Aspen-M-2 and IonQ QPUs: Ref: https://aws.amazon.com/blogs/quantum-computing/setting-up-your-local-development-environment-in-amazon-braket/ and the training job from my first post ran on a ml.m5.2xlarge instance with the lightning.qubit simulator. Just the ml.p3.2xlarge instance with the lightning.gpu simulator does not work.
@arthurlobo , that is much more strange, and I am not able to reproduce it.
I would suggest you reach out to the AWS Support team ((https://support.console.aws.amazon.com/support/home#/)) to ask them to look into this issue. You can also ask them to raise your default resource limits, if that is the problem.
@krneta AWS Support wanted to know the instance category - (a notebook instance, a processing job instance or a training job instance, etc.). I specified training job instance - qty. 2 for ml.p3.16xlarge with region: us-east-1.
Earlier they had approved two SageMaker Notebook ml.p3.2xlarge instances.
Which type of instance were you using when you were not able to reproduce the issue?
Hi @arthurlobo ,
Yes, a training job instance is correct. I was using the ml.p3.2xlarge instance type, as you specified, for my Braket Hybrid Job.
Did it help with your issue?
@krneta AWS support increased my quota to two ml.p3.16xlarge training job instances but it did not help - gave the same AccessDeniedException. I have asked AWS Support to specifically enable the p3 instance for use within Amazon Braket Hybrid Jobs.
Hi @arthurlobo,
Sorry it took so long, but since you reached out to AWS customer service, I was able to get more information about what went wrong with your attempts. I've tried to increase your quota further, and I believe that should fix the issue. Could you please try again and see if the issue persists?
Thanks, Milan
I've tried to increase your quota further, and I believe that should fix the issue. Could you please try again and see if the issue persists?
@krneta I ran my script again - this time I get a ServiceQuotaExceeded Exception instead of the AccessDenied Exception from earlier for both the ml.p3.2xlarge and ml.p3.16xlarge instances. I have include the full output:
python parallelize_training.py
Traceback (most recent call last):
File "/media/arthurlobo/QML/amazon-braket-examples/examples/hybrid_jobs/5_Parallelize_training_for_QML/parallelize_training.py", line 27, in
Hi @arthurlobo, it looks like you still have service quota of 0 for ml.p3.2xlarge
instance type, can you verify again if your service quota increase have been approved for the particular instance type you are trying to run the job?
Let us know if the issue still persists, so that we can get you unblocked.
@virajvchaudhari following is the email I got from AWS Support Accounts and Billing Team on August 19 (Case ID 10451636761):
Hi there,
Greetings from AWS!
Thank you for your patience while I was working on the request.
I am happy to inform you that the quota increase request has now been approved. For your convenience, I've mentioned the details below.
Service:SageMaker Notebook Instances Region: US East (Northern Virginia) Resource Type: Training Job Instances Limit name: ml.p3.16xlarge New limit value: 2
Please allow 30 minutes for these limits to be updated on your account.
Best regards, Nikitha Amazon Web Services
Is this sufficient information or if you require me to check on my side how do I go about finding out my service quota for the p3 instance?
Hi @arthurlobo,
Can you please try to run another job using an ml.p3.16xlarge instance (rather than the ml.p3.2xlarge instance) to see if that fixed the problem? We'll also look into if we can grant you access to the ml.p3.2xlarge instances using your current customer support ticket.
@krneta I ran a job with the ml.p3.16xlarge instance and got the same ServiceQuotaExceeded Exception.
@krneta FYI I had also posted the issue on AWS re:Post including the AWS support details of the increased ml.p3.16xlarge job training instance quota and got the following response from Christian_M of the AWS Quantum Technologies Community Group:
"Hi there- Braket Hybrid Jobs is a separate product and does not relate to your SageMaker limits. Support would need to help you get unblocked for Hybrid Jobs specifically to get access to the desired resource. If you have an open support case feel free to use this statement as a follow-up in that case."
@arthurlobo ,
It seems like things have changed since we've started digging into this issue. Would it be possible for you to go to: https://console.aws.amazon.com/servicequotas/home/services/braket/quotas
And click on the instances you want to try and request an increase to the limit from there?
@krneta I requested a quota increase for 2 instances of ml.p3.16xlarge, Support Center case number: 10722283951
@krneta I received approval for Amazon Braket Maximum number of instances of ml.p3.16xlarge for jobs and submitted a training job using the pennylane lightning-gpu simulator. It is running.
@arthurlobo, I'm very happy to hear that. I hope your job runs successfully.
I know this took a long time to resolve, and I want to thank you for being patient with us (I'm sure it wasn't easy). If there are no further issues, please resolve/close this and any related (re:post) issues. If there are further issues, please don't hesitate to reach out again.
When I run the following code using Amazon Braket SDK on my local ARM processor I get the error message: botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the CreateJob operation: This account is not authorized to use this resource. In order to access additional resources, please contact customer support.
input_file_path = "data/sonar.all-data"
from braket.jobs.config import InstanceConfig from braket.aws import AwsSession from braket.jobs.image_uris import Framework, retrieve_image
instance_config = InstanceConfig(instanceType='ml.p3.2xlarge')
hyperparameters={"nwires": "10", "ndata": "64", "batch_size": "64", "epochs": "5", "gamma": "0.99", "lr": "0.1", "seed": "42", }
input_file_path = "data/sonar.all-data"
image_uri = retrieve_image(Framework.PL_PYTORCH, AwsSession().region)
import time from braket.aws import AwsQuantumJob
job = AwsQuantumJob.create( device="local:pennylane/lightning.gpu", source_module="qml_script", entry_point="qml_script.train_single", job_name="qml-single-" + str(int(time.time())), hyperparameters=hyperparameters, input_data={"input-data": input_file_path}, instance_config=instance_config, image_uri=image_uri, wait_until_complete=False, )
print(job.result())
The code was used from the following Amazon Braket example:
https://github.com/aws/amazon-braket-examples/blob/main/examples/hybrid_jobs/5_Parallelize_training_for_QML/Parallelize_training_for_QML.ipynb
Could the error be related to insufficient quota for the ml.p3.2xlarge SageMaker Notebook instance?