Closed llealgt closed 1 year ago
Hi @llealgt, thank you for submitting a bug report!
Given the issue signature, your IAM user/role may not have write permissions to the S3 bucket being used. I was not able to reproduce on my normal setup; however, using another S3 bucket with no write access, I did get stuck at the same place:
You can adjust the Python logging level to ERROR
to see if you are getting a similar exception from the prepare_movielens_data
command.
Hi @michaelnchin thanks for your reply, indeed it is a "access denied" issue, however I never considered that option because I was able to run the other(non ML) example notebooks using the same S3 bucket. Is there any difference between how this one downloads and copies the data compared to the other notebooks?
Hi @michaelnchin thanks for your help, I found write permission was revoked and the role had read-only, that's why it worked before but not now. We can close this issue
Hi @michaelnchin I know I closed this 1 hour ago but I found another issue, when running:
endpoints=neptune_ml.setup_pretrained_endpoints(s3_bucket_uri, setup_node_classification, setup_node_regression, setup_link_prediction, setup_edge_classification, setup_edge_regression)
the result is None, and I see the following in the logs:
ERROR:root:Unable to determine the Neptune ML IAM Role.
Any suggestions on how to fix it, or how to debug it?
You need to include the Neptune ML IAM role (created and attached to your Neptune cluster as documented here) in your local .bashrc
file.
Use the following command to add it:
echo "export NEPTUNE_ML_ROLE_ARN=[YOUR_NEPTUNE_ML_IAM_ROLE_ARN]" >> ~/.bashrc
For other ML notebooks, some neptune_ml
commands also look in .bashrc
for your Neptune Export service endpoint, so you should also include this using something similar to:
echo "export NEPTUNE_EXPORT_API_URI=https://3ui13o134.execute-api.us-east-1.amazonaws.com/v1/neptune-export" >> ~/.bashrc
Hi @michaelnchin I had created and attached the role as the document you shared last week, so I guess what I'm missing is work on the .bashrc file(maybe I missed it but I did not saw that in the documentation), I will try that today. Thanks
Hi @michaelnchin I tested adding the export you suggested to the .bashrc file, it seems like it solved the previous error, now I get a new one:
ERROR:root:An error occurred (AccessDeniedException) when calling the CreateModel operation: User: arn:aws:sts::732279101103:assumed-role/AWSNeptuneNotebookRole-neptune-test2-lleal-role/SageMaker is not authorized to perform: sagemaker:CreateModel on resource: arn:aws:sagemaker:us-east-1:732279101103:model/classifi-2023-02-13-23-16-27 because no identity-based policy allows the sagemaker:CreateModel action
Where AWSNeptuneNotebookRole-neptune-test2-lleal-role is a role created during notebook configuration by choosing "create an iam role", so I guess this role needs some permissions to be added, is there any list of permissions suggested for this role?
Yes, your Sagemaker notebook also requires additional permissions. Attach a policy containing the following to your notebook's IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": "arn:aws:cloudwatch:[AWS_REGION]:[AWS_ACCOUNT_ID]:*",
"Effect": "Allow"
},
{
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogStreams",
"logs:PutLogEvents",
"logs:GetLogEvents"
],
"Resource": "arn:aws:logs:[AWS_REGION]:[AWS_ACCOUNT_ID]:*",
"Effect": "Allow"
},
{
"Action": "neptune-db:*",
"Resource": "arn:aws:neptune-db:[AWS_REGION]:[AWS_ACCOUNT_ID]:[CLUSTER_RESOURCE_ID]/*",
"Effect": "Allow"
},
{
"Action": [
"s3:Put*",
"s3:Get*",
"s3:List*"
],
"Resource": "arn:aws:s3:::*",
"Effect": "Allow"
},
{
"Action": "execute-api:Invoke",
"Resource": "arn:aws:execute-api:[AWS_REGION]:[AWS_ACCOUNT_ID]:*/*",
"Effect": "Allow"
},
{
"Action": [
"sagemaker:CreateModel",
"sagemaker:CreateEndpointConfig",
"sagemaker:CreateEndpoint",
"sagemaker:DescribeModel",
"sagemaker:DescribeEndpointConfig",
"sagemaker:DescribeEndpoint",
"sagemaker:DeleteModel",
"sagemaker:DeleteEndpointConfig",
"sagemaker:DeleteEndpoint"
],
"Resource": "arn:aws:sagemaker:[AWS_REGION]:[AWS_ACCOUNT_ID]:*/*",
"Effect": "Allow"
},
{
"Action": [
"iam:PassRole"
],
"Resource": "[YOUR_NEPTUNE_ML_IAM_ROLE_ARN]",
"Effect": "Allow"
}
]
}
Thanks! It works, now I'm trying to run notebook Neptune-ML-01-Introduction-to-Node-Classification-Gremlin It fails at
%%neptune_ml export start --export-url {neptune_ml.get_export_service_host()} --export-iam --wait --store-to export_results
${export_params}
But returns the following error(it seems as the notebook cannot reach the export service)
Max retries exceeded with url: /v1/neptune-export (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe5b46f8a10>: Failed to establish a new connection: [Errno 110] Connection timed out'))"))}
I had performed the guide twice(https://docs.aws.amazon.com/neptune/latest/userguide/export-service.html) with no luck. Is there anything else needed in order to run the notebook? Like configuring some additional permission for the notebook instance?
Please verify that your Neptune cluster, Export service, and Sagemaker notebook instance are all in the same VPC and have the subnets/security groups configured correctly.
One other specific thing to check, please also ensure that the NeptuneExportSecurityGroup
created by the Export stack is attached to the Sagemaker instance:
Enable access to the Neptune-Export endpoint from a VPC-based EC2 instance
If you make your Neptune-Export endpoint VPC-only, you can only access it from within the VPC in which the Neptune-Export service is installed. To allow connectivity from an Amazon EC2 instance in the VPC from which you can make Neptune-Export API calls, attach the NeptuneExportSecurityGroup created by the AWS CloudFormation stack to that Amazon EC2 instance.
Thanks! I created a new notebooks instance, the original instance did not allow me to change these settings, I'm think i'm very close now, currently facing the following error:
"status": "failed",
"logs": "https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/batch/job;stream=neptune-export-job-09508250/default/ba46ae8014a548c997d3774e2389d342",
"reason": "An error occurred while connecting to Neptune. Ensure you have not disabled SSL if the database requires SSL in transit. Ensure you have specified the --use-iam-auth flag (and set the SERVICE_REGION environment variable if running in your own environment) if the database uses IAM database authentication. Ensure the database's VPC security group(s) allow access from the export tool."
neptune_ml.get_iam() returns False, should it return True?
It should return True if your cluster is IAM enabled. The result of neptune_ml.get_iam()
is based on the notebook config file in your home directory; this should be /home/ec2-user/graph_notebook_config.json
on Sagemaker. Could you check that the contents of this config file match your cluster details (especially for auth_mode
field)?
Thanks for your help, it seems as the cluster VPC security group was not configured correctly to allow access from the export security group, now the export service is up and running and the cell now runs successfully
and sorry to do so many questions that can be obvious(I'm more of a data science/ML person than an cloud or ops person)
now the cell
%neptune_ml dataprocessing start --wait --store-to processing_results {processing_params}
returns
{
"detailedMessage": "Unable to connect to vpc endpoint. Please check your vpc configuration.",
"requestId": "7a6f5966-4e53-41bb-b568-6f8ab2dc620c",
"code": "BadRequestException"
}
Any hints on that one?
UPDATE: I manually tried to curl the endpoint {the endpoint found in the neptune console for this cluster}/ml/dataprocessing
but it cannot be reached so I guess similar as how the export service endpoint needed to be enabled from the notebook host, now I need to enable the neptune endpoint from the notebook host
The error suggests that Neptune is unable to connect to the Sagemaker VPC endpoints, can you confirm that you have created endpoints for both sagemaker.runtime
and sagemaker.api
, as documented here:
In regards to the update note, it is expected that a normal %%bash curl
query to the IAM enabled Neptune cluster will fail, as the request is not signed with sufficient authentication headers. This should not be a concern with %%neptune_ml
commands, which automatically create the IAM authenticated requests; as the command returned an internal engine error directly from Neptune, your Notebook<->Neptune connectivity config appears fine.
Yep, they were created some days ago and they were working correctly when running the first notebook example, but it is fixed now, I found that when the notebook instance was restarted some minutes ago, as expected the variables defined in the .bashrc file(which you suggested some days ago) were lost, so not sure if it had something to do with the role specified there, but when creating the variables again it worked(I will add those exports to the notebook lyfecicle file).
Now I have the following error, but I guess that should be easily fixed by changed the job configurations to a different EC2 instance size:
"detailedMessage": "AmazonSageMakerException: The account-level service limit 'ml.r5.large for processing job usage' is 0 Instances
Thank you for confirming the fix, @llealgt.
For the Sagemaker account service limit exception, you will need to contact AWS support to request an increase: https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-resource-limit-exceeded-error/
Thanks! I thought maybe using a different EC2 size could work because I thought the limits where instance size dependent but I will check the link you just shared.
One last question, now the notebook magic
%neptune_ml dataprocessing
works(assuming i fix the service limit) so I assume neptune ML and sagemaker are correctly configured, but assuming I want to curl the datapreprocessing entpoints via awscurl from the notebook terminal (something like this https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-on-graphs-processing.html)
do I need to set another configuration? currently it doesn't work, it time-out so I guess the notebook cannot reach neptune endpoints.(I use awscurl instead of curl , as example for the export service it works fine, but not for neptune endpoints)
Can you share the awscurl
command you are using?
awscurl \
-X POST https://{my neptune endpoint copied from the neptune console}/ml/dataprocessing \
-H 'Content-Type: application/json' \
-d '{
"inputDataS3Location" : "s3://{my bucket}/neptune-export/20230217_215839",
"id" : "test",
"processedDataS3Location" : "s3://{my bucket}/preprocessed",
"configFileName" : "training-job-configuration.json"
}'
You may need to specify additional options, ex. for --service
as Neptune instead of the default execute-api
.
Try the following to check against the status endpoint:
awscurl https://<neptune_endpoint>:<neptune_port>/status --service neptune-db --region <region>
I see my issue, I was missing the port, I think that solves it, thanks for all of your help!
Describe the bug When trying to run ML notebooks (specifically Neptune-ML-00-Getting-Started-with-Neptune-ML-Gremlin.ipynb) the process never loads any data.
response = neptune_ml.prepare_movielens_data(s3_bucket_uri)
runs and shows "Processing X" (where X is Movies or Ratingss) but it doesn't load any data, it can be cheked by inspecting the S3 bucket indicated by variables3_bucket_uri
but alsoprint(response)
returns None.To Reproduce Steps to reproduce the behavior:
s3_bucket_uri="s3://<INSERT S3 BUCKET OR PATH>"
Expected behavior Sample data should be loaded to the S3 location but it is not.