aws / graph-notebook

Library extending Jupyter notebooks to integrate with Apache TinkerPop, openCypher, and RDF SPARQL.
https://github.com/aws/graph-notebook
Apache License 2.0
735 stars 168 forks source link

[BUG] ML notebook examples not loading data #445

Closed llealgt closed 1 year ago

llealgt commented 1 year ago

Describe the bug When trying to run ML notebooks (specifically Neptune-ML-00-Getting-Started-with-Neptune-ML-Gremlin.ipynb) the process never loads any data.

response = neptune_ml.prepare_movielens_data(s3_bucket_uri) runs and shows "Processing X" (where X is Movies or Ratingss) but it doesn't load any data, it can be cheked by inspecting the S3 bucket indicated by variable s3_bucket_uri but also print(response) returns None.

To Reproduce Steps to reproduce the behavior:

  1. Go to a Neptune sagemaker notebook instance
  2. Open the notebook 04-Machine-Learning/Neptune-ML-00-Getting-Started-with-Neptune-ML-Gremlin.ipynb
  3. Add a valid S3 location on s3_bucket_uri="s3://<INSERT S3 BUCKET OR PATH>"
  4. Run all cells

Expected behavior Sample data should be loaded to the S3 location but it is not.

michaelnchin commented 1 year ago

Hi @llealgt, thank you for submitting a bug report!

Given the issue signature, your IAM user/role may not have write permissions to the S3 bucket being used. I was not able to reproduce on my normal setup; however, using another S3 bucket with no write access, I did get stuck at the same place:

Screen Shot 2023-02-09 at 8 01 43 PM

You can adjust the Python logging level to ERROR to see if you are getting a similar exception from the prepare_movielens_data command.

llealgt commented 1 year ago

Hi @michaelnchin thanks for your reply, indeed it is a "access denied" issue, however I never considered that option because I was able to run the other(non ML) example notebooks using the same S3 bucket. Is there any difference between how this one downloads and copies the data compared to the other notebooks?

llealgt commented 1 year ago

Hi @michaelnchin thanks for your help, I found write permission was revoked and the role had read-only, that's why it worked before but not now. We can close this issue

llealgt commented 1 year ago

Hi @michaelnchin I know I closed this 1 hour ago but I found another issue, when running: endpoints=neptune_ml.setup_pretrained_endpoints(s3_bucket_uri, setup_node_classification, setup_node_regression, setup_link_prediction, setup_edge_classification, setup_edge_regression) the result is None, and I see the following in the logs: ERROR:root:Unable to determine the Neptune ML IAM Role.

Any suggestions on how to fix it, or how to debug it?

michaelnchin commented 1 year ago

You need to include the Neptune ML IAM role (created and attached to your Neptune cluster as documented here) in your local .bashrc file.

Use the following command to add it:

echo "export NEPTUNE_ML_ROLE_ARN=[YOUR_NEPTUNE_ML_IAM_ROLE_ARN]" >> ~/.bashrc

For other ML notebooks, some neptune_ml commands also look in .bashrc for your Neptune Export service endpoint, so you should also include this using something similar to:

echo "export NEPTUNE_EXPORT_API_URI=https://3ui13o134.execute-api.us-east-1.amazonaws.com/v1/neptune-export" >> ~/.bashrc
llealgt commented 1 year ago

Hi @michaelnchin I had created and attached the role as the document you shared last week, so I guess what I'm missing is work on the .bashrc file(maybe I missed it but I did not saw that in the documentation), I will try that today. Thanks

llealgt commented 1 year ago

Hi @michaelnchin I tested adding the export you suggested to the .bashrc file, it seems like it solved the previous error, now I get a new one: ERROR:root:An error occurred (AccessDeniedException) when calling the CreateModel operation: User: arn:aws:sts::732279101103:assumed-role/AWSNeptuneNotebookRole-neptune-test2-lleal-role/SageMaker is not authorized to perform: sagemaker:CreateModel on resource: arn:aws:sagemaker:us-east-1:732279101103:model/classifi-2023-02-13-23-16-27 because no identity-based policy allows the sagemaker:CreateModel action

Where AWSNeptuneNotebookRole-neptune-test2-lleal-role is a role created during notebook configuration by choosing "create an iam role", so I guess this role needs some permissions to be added, is there any list of permissions suggested for this role?

michaelnchin commented 1 year ago

Yes, your Sagemaker notebook also requires additional permissions. Attach a policy containing the following to your notebook's IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": "arn:aws:cloudwatch:[AWS_REGION]:[AWS_ACCOUNT_ID]:*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents",
                "logs:GetLogEvents"
            ],
            "Resource": "arn:aws:logs:[AWS_REGION]:[AWS_ACCOUNT_ID]:*",
            "Effect": "Allow"
        },
        {
            "Action": "neptune-db:*",
            "Resource": "arn:aws:neptune-db:[AWS_REGION]:[AWS_ACCOUNT_ID]:[CLUSTER_RESOURCE_ID]/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:Put*",
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": "arn:aws:s3:::*",
            "Effect": "Allow"
        },
        {
            "Action": "execute-api:Invoke",
            "Resource": "arn:aws:execute-api:[AWS_REGION]:[AWS_ACCOUNT_ID]:*/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "sagemaker:CreateModel",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:CreateEndpoint",
                "sagemaker:DescribeModel",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:DescribeEndpoint",
                "sagemaker:DeleteModel",
                "sagemaker:DeleteEndpointConfig",
                "sagemaker:DeleteEndpoint"
            ],
            "Resource": "arn:aws:sagemaker:[AWS_REGION]:[AWS_ACCOUNT_ID]:*/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "[YOUR_NEPTUNE_ML_IAM_ROLE_ARN]",
            "Effect": "Allow"
        }
    ]
}
llealgt commented 1 year ago

Thanks! It works, now I'm trying to run notebook Neptune-ML-01-Introduction-to-Node-Classification-Gremlin It fails at

%%neptune_ml export start --export-url {neptune_ml.get_export_service_host()} --export-iam --wait --store-to export_results
${export_params}

But returns the following error(it seems as the notebook cannot reach the export service) Max retries exceeded with url: /v1/neptune-export (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe5b46f8a10>: Failed to establish a new connection: [Errno 110] Connection timed out'))"))} I had performed the guide twice(https://docs.aws.amazon.com/neptune/latest/userguide/export-service.html) with no luck. Is there anything else needed in order to run the notebook? Like configuring some additional permission for the notebook instance?

michaelnchin commented 1 year ago

Please verify that your Neptune cluster, Export service, and Sagemaker notebook instance are all in the same VPC and have the subnets/security groups configured correctly.

One other specific thing to check, please also ensure that the NeptuneExportSecurityGroup created by the Export stack is attached to the Sagemaker instance:

Enable access to the Neptune-Export endpoint from a VPC-based EC2 instance

If you make your Neptune-Export endpoint VPC-only, you can only access it from within the VPC in which the Neptune-Export service is installed. To allow connectivity from an Amazon EC2 instance in the VPC from which you can make Neptune-Export API calls, attach the NeptuneExportSecurityGroup created by the AWS CloudFormation stack to that Amazon EC2 instance.
llealgt commented 1 year ago

Thanks! I created a new notebooks instance, the original instance did not allow me to change these settings, I'm think i'm very close now, currently facing the following error:

 "status": "failed",
  "logs": "https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/batch/job;stream=neptune-export-job-09508250/default/ba46ae8014a548c997d3774e2389d342",
  "reason": "An error occurred while connecting to Neptune. Ensure you have not disabled SSL if the database requires SSL in transit. Ensure you have specified the --use-iam-auth flag (and set the SERVICE_REGION environment variable if running in your own environment) if the database uses IAM database authentication. Ensure the database's VPC security group(s) allow access from the export tool."

neptune_ml.get_iam() returns False, should it return True?

michaelnchin commented 1 year ago

It should return True if your cluster is IAM enabled. The result of neptune_ml.get_iam() is based on the notebook config file in your home directory; this should be /home/ec2-user/graph_notebook_config.json on Sagemaker. Could you check that the contents of this config file match your cluster details (especially for auth_mode field)?

llealgt commented 1 year ago

Thanks for your help, it seems as the cluster VPC security group was not configured correctly to allow access from the export security group, now the export service is up and running and the cell now runs successfully and sorry to do so many questions that can be obvious(I'm more of a data science/ML person than an cloud or ops person) now the cell %neptune_ml dataprocessing start --wait --store-to processing_results {processing_params}

returns

{
  "detailedMessage": "Unable to connect to vpc endpoint. Please check your vpc configuration.",
  "requestId": "7a6f5966-4e53-41bb-b568-6f8ab2dc620c",
  "code": "BadRequestException"
}

Any hints on that one? UPDATE: I manually tried to curl the endpoint {the endpoint found in the neptune console for this cluster}/ml/dataprocessing but it cannot be reached so I guess similar as how the export service endpoint needed to be enabled from the notebook host, now I need to enable the neptune endpoint from the notebook host

michaelnchin commented 1 year ago

The error suggests that Neptune is unable to connect to the Sagemaker VPC endpoints, can you confirm that you have created endpoints for both sagemaker.runtime and sagemaker.api , as documented here:

https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-manual-setup.html#ml-manual-setup-endpoints

In regards to the update note, it is expected that a normal %%bash curl query to the IAM enabled Neptune cluster will fail, as the request is not signed with sufficient authentication headers. This should not be a concern with %%neptune_ml commands, which automatically create the IAM authenticated requests; as the command returned an internal engine error directly from Neptune, your Notebook<->Neptune connectivity config appears fine.

llealgt commented 1 year ago

Yep, they were created some days ago and they were working correctly when running the first notebook example, but it is fixed now, I found that when the notebook instance was restarted some minutes ago, as expected the variables defined in the .bashrc file(which you suggested some days ago) were lost, so not sure if it had something to do with the role specified there, but when creating the variables again it worked(I will add those exports to the notebook lyfecicle file).

Now I have the following error, but I guess that should be easily fixed by changed the job configurations to a different EC2 instance size: "detailedMessage": "AmazonSageMakerException: The account-level service limit 'ml.r5.large for processing job usage' is 0 Instances

michaelnchin commented 1 year ago

Thank you for confirming the fix, @llealgt.

For the Sagemaker account service limit exception, you will need to contact AWS support to request an increase: https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-resource-limit-exceeded-error/

llealgt commented 1 year ago

Thanks! I thought maybe using a different EC2 size could work because I thought the limits where instance size dependent but I will check the link you just shared.

One last question, now the notebook magic %neptune_ml dataprocessing works(assuming i fix the service limit) so I assume neptune ML and sagemaker are correctly configured, but assuming I want to curl the datapreprocessing entpoints via awscurl from the notebook terminal (something like this https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-on-graphs-processing.html) do I need to set another configuration? currently it doesn't work, it time-out so I guess the notebook cannot reach neptune endpoints.(I use awscurl instead of curl , as example for the export service it works fine, but not for neptune endpoints)

michaelnchin commented 1 year ago

Can you share the awscurl command you are using?

llealgt commented 1 year ago
awscurl \
  -X POST https://{my neptune endpoint copied from the neptune console}/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://{my bucket}/neptune-export/20230217_215839",
        "id" : "test",
        "processedDataS3Location" : "s3://{my bucket}/preprocessed",
        "configFileName" : "training-job-configuration.json"
      }'
michaelnchin commented 1 year ago

You may need to specify additional options, ex. for --service as Neptune instead of the default execute-api.

Try the following to check against the status endpoint:

awscurl https://<neptune_endpoint>:<neptune_port>/status --service neptune-db --region <region>
llealgt commented 1 year ago

I see my issue, I was missing the port, I think that solves it, thanks for all of your help!