amazon-braket / amazon-braket-examples

Example notebooks that show how to apply quantum computing with Amazon Braket.
https://aws.amazon.com/braket/
Apache License 2.0
462 stars 224 forks source link

Parallelize_training_for_QML #181

Closed arthurlobo closed 2 years ago

arthurlobo commented 2 years ago

When I run the following code using Amazon Braket SDK on my local ARM processor I get the error message: botocore.errorfactory.AccessDeniedException: An error occurred (AccessDeniedException) when calling the CreateJob operation: This account is not authorized to use this resource. In order to access additional resources, please contact customer support.

input_file_path = "data/sonar.all-data"

from braket.jobs.config import InstanceConfig from braket.aws import AwsSession from braket.jobs.image_uris import Framework, retrieve_image

instance_config = InstanceConfig(instanceType='ml.p3.2xlarge')

hyperparameters={"nwires": "10", "ndata": "64", "batch_size": "64", "epochs": "5", "gamma": "0.99", "lr": "0.1", "seed": "42", }

input_file_path = "data/sonar.all-data"

image_uri = retrieve_image(Framework.PL_PYTORCH, AwsSession().region)

import time from braket.aws import AwsQuantumJob

job = AwsQuantumJob.create( device="local:pennylane/lightning.gpu", source_module="qml_script", entry_point="qml_script.train_single", job_name="qml-single-" + str(int(time.time())), hyperparameters=hyperparameters, input_data={"input-data": input_file_path}, instance_config=instance_config, image_uri=image_uri, wait_until_complete=False, )

print(job.result())

The code was used from the following Amazon Braket example:

https://github.com/aws/amazon-braket-examples/blob/main/examples/hybrid_jobs/5_Parallelize_training_for_QML/Parallelize_training_for_QML.ipynb

Could the error be related to insufficient quota for the ml.p3.2xlarge SageMaker Notebook instance?

krneta commented 2 years ago

Hi @arthurlobo,

Insufficient quota errors usually result in "ResourceLimitExceeded" exceptions. The error you get seems to be a problem with your credentials. I would look at https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html to make sure you're setup your AWS account credentials correctly.

If you have not run a job successfully with this account before, I would also recommend you take a look at the Hybrid Jobs getting started page https://docs.aws.amazon.com/braket/latest/developerguide/braket-jobs-first.html to ensure you're setup your account to run with Braket. Please verify that the account you're using for setting up Braket is the same you have setup when using the AWS CLI.

I hope this helps. If you continue to see the same error, please let us know.

Thanks!

arthurlobo commented 2 years ago

@krneta I meant to say Amazon Braket SDK instead of AWS CLI. Note that I was able to run the Bell pair circuit on Aspen-11, Aspen-M-2 and IonQ QPUs: Ref: https://aws.amazon.com/blogs/quantum-computing/setting-up-your-local-development-environment-in-amazon-braket/ and the training job from my first post ran on a ml.m5.2xlarge instance with the lightning.qubit simulator. Just the ml.p3.2xlarge instance with the lightning.gpu simulator does not work.

krneta commented 2 years ago

@arthurlobo , that is much more strange, and I am not able to reproduce it.

I would suggest you reach out to the AWS Support team ((https://support.console.aws.amazon.com/support/home#/)) to ask them to look into this issue. You can also ask them to raise your default resource limits, if that is the problem.

arthurlobo commented 2 years ago

@krneta AWS Support wanted to know the instance category - (a notebook instance, a processing job instance or a training job instance, etc.). I specified training job instance - qty. 2 for ml.p3.16xlarge with region: us-east-1.
Earlier they had approved two SageMaker Notebook ml.p3.2xlarge instances. Which type of instance were you using when you were not able to reproduce the issue?

krneta commented 2 years ago

Hi @arthurlobo ,

Yes, a training job instance is correct. I was using the ml.p3.2xlarge instance type, as you specified, for my Braket Hybrid Job.

Did it help with your issue?

arthurlobo commented 2 years ago

@krneta AWS support increased my quota to two ml.p3.16xlarge training job instances but it did not help - gave the same AccessDeniedException. I have asked AWS Support to specifically enable the p3 instance for use within Amazon Braket Hybrid Jobs.

krneta commented 2 years ago

Hi @arthurlobo,

Sorry it took so long, but since you reached out to AWS customer service, I was able to get more information about what went wrong with your attempts. I've tried to increase your quota further, and I believe that should fix the issue. Could you please try again and see if the issue persists?

Thanks, Milan

arthurlobo commented 2 years ago

I've tried to increase your quota further, and I believe that should fix the issue. Could you please try again and see if the issue persists?

@krneta I ran my script again - this time I get a ServiceQuotaExceeded Exception instead of the AccessDenied Exception from earlier for both the ml.p3.2xlarge and ml.p3.16xlarge instances. I have include the full output:

python parallelize_training.py Traceback (most recent call last): File "/media/arthurlobo/QML/amazon-braket-examples/examples/hybrid_jobs/5_Parallelize_training_for_QML/parallelize_training.py", line 27, in job = AwsQuantumJob.create( File "/home/arthurlobo/.conda/envs/braket/lib/python3.10/site-packages/braket/aws/aws_quantum_job.py", line 198, in create job_arn = aws_session.create_job(create_job_kwargs) File "/home/arthurlobo/.conda/envs/braket/lib/python3.10/site-packages/braket/aws/aws_session.py", line 211, in create_job response = self.braket_client.create_job(boto3_kwargs) File "/home/arthurlobo/.conda/envs/braket/lib/python3.10/site-packages/botocore/client.py", line 508, in _api_call return self._make_api_call(operation_name, kwargs) File "/home/arthurlobo/.conda/envs/braket/lib/python3.10/site-packages/botocore/client.py", line 915, in _make_api_call raise error_class(parsed_response, operation_name) botocore.errorfactory.ServiceQuotaExceededException: An error occurred (ServiceQuotaExceededException) when calling the CreateJob operation: You have exceeded the service quota of 0 for instance ml.p3.2xlarge in the region us-east-1. Please reach out to AWS Support to increase the service quota for the instance type. In the meanwhile, you may wait for some of your other running jobs on ml.p3.2xlarge to complete before retrying job creation.

virajvchaudhari commented 2 years ago

Hi @arthurlobo, it looks like you still have service quota of 0 for ml.p3.2xlarge instance type, can you verify again if your service quota increase have been approved for the particular instance type you are trying to run the job?

Let us know if the issue still persists, so that we can get you unblocked.

arthurlobo commented 2 years ago

@virajvchaudhari following is the email I got from AWS Support Accounts and Billing Team on August 19 (Case ID 10451636761):

Hi there,

Greetings from AWS!

Thank you for your patience while I was working on the request.

I am happy to inform you that the quota increase request has now been approved. For your convenience, I've mentioned the details below.

Service:SageMaker Notebook Instances Region: US East (Northern Virginia) Resource Type: Training Job Instances Limit name: ml.p3.16xlarge New limit value: 2

Please allow 30 minutes for these limits to be updated on your account.

Best regards, Nikitha Amazon Web Services

Is this sufficient information or if you require me to check on my side how do I go about finding out my service quota for the p3 instance?

krneta commented 2 years ago

Hi @arthurlobo,

Can you please try to run another job using an ml.p3.16xlarge instance (rather than the ml.p3.2xlarge instance) to see if that fixed the problem? We'll also look into if we can grant you access to the ml.p3.2xlarge instances using your current customer support ticket.

arthurlobo commented 2 years ago

@krneta I ran a job with the ml.p3.16xlarge instance and got the same ServiceQuotaExceeded Exception.

arthurlobo commented 2 years ago

@krneta FYI I had also posted the issue on AWS re:Post including the AWS support details of the increased ml.p3.16xlarge job training instance quota and got the following response from Christian_M of the AWS Quantum Technologies Community Group:

"Hi there- Braket Hybrid Jobs is a separate product and does not relate to your SageMaker limits. Support would need to help you get unblocked for Hybrid Jobs specifically to get access to the desired resource. If you have an open support case feel free to use this statement as a follow-up in that case."

krneta commented 2 years ago

@arthurlobo ,

It seems like things have changed since we've started digging into this issue. Would it be possible for you to go to: https://console.aws.amazon.com/servicequotas/home/services/braket/quotas

And click on the instances you want to try and request an increase to the limit from there?

arthurlobo commented 2 years ago

@krneta I requested a quota increase for 2 instances of ml.p3.16xlarge, Support Center case number: 10722283951

arthurlobo commented 2 years ago

@krneta I received approval for Amazon Braket Maximum number of instances of ml.p3.16xlarge for jobs and submitted a training job using the pennylane lightning-gpu simulator. It is running.

krneta commented 2 years ago

@arthurlobo, I'm very happy to hear that. I hope your job runs successfully.

I know this took a long time to resolve, and I want to thank you for being patient with us (I'm sure it wasn't easy). If there are no further issues, please resolve/close this and any related (re:post) issues. If there are further issues, please don't hesitate to reach out again.