Set up front end to allow public access to query Neptune

baskaufs commented 2 years ago

In this comment I referenced using a load balancer to allow access outside of the VPC.

It is unclear to me how this would be used to manage Query only (non-Update) access. The whole description on the instruction page is focused on restricting access to the Neptune resource (e.g. restrict to a range of IP addresses), and not on creating a publicly accessible resource. In #58, I had been assuming that one would manage the graph on the back end using the "SSH tunneling" approach. However, it occurs to me that allowing access outside the VPC for querying through a load balancer wouldn't necessarily prevent writing to the database if SPARQL Update were allowed.

In the existing sparql.vanderbilt.edu Blazegraph installation, we dealt with this problem by restricting HTTP requests to GET since Query can be done using either GET or POST and Update requires POST. However, since then I realized that there are common circumstances where queries need to be done using POST, so this isn't a good solution. I don't understand enough about what how a load balancer works to know if it could be set up to require authentication for SPARQL Update but not for SPARQL Query.

Also, does the load balancer handle HTTPS for us? Or do we also need CloudFront for that?

baskaufs commented 2 years ago

Here are the URLs Cliff referenced:

https://docs.aws.amazon.com/neptune/latest/userguide/sparql-api-reference.html generic landing page for HTTP access using SPARQL

https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-endpoints.html Gives examples of endpoint URLs and explains that you should use the cluster endpoints, not the instance endpoints. The endpoint examples make it look like I was using the correct form when I was having the problems here

https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-sparql-http-rest.html (mentions using the EC2 instance, but Cliff doesn't think that's necessary if we configure correctly).

CliffordAnderson commented 2 years ago

I've been looking into this question too and I may be wrong. We may need a proxy on EC2, though I find stray references to a solution that does not need that infrastructure.

baskaufs commented 2 years ago

Interestingly, in one reference on doing federated queries they mention that to federate with a service outside of the VPC, you have to set up a reverse proxy. That's really paranoid since it's not about giving access to data in the triplestore, but I suppose typical for AWS where applications can't communicate with anything else unless explicitly given permission.

baskaufs commented 2 years ago

The first solution here seems to be the most straightforward one for allowing access to external clients. It uses a network load balancer and therefore doesn't need the EC2 instance. So it is a little simpler than the second example that uses an application load balancer, requires the EC2, and relies on some third party software.

There are, however, two issues with the first configuration:

I don't understand if the lambda is actually required. I think maybe not if we are only using a single replica.
It seems like we might have to run two separate network load balancers: one for the read-only endpoint (no authentication required) and a second one with authentication pointing to the writable endpoint. That's assuming we can use it for loading data with SPARQL Update and skip the EC2 SSH tunneling thing.

I think the most straightforward thing is to just try to set up the load balancer pointing to the read-only cluster (not instance) endpoint, and see if it works.

baskaufs commented 2 years ago

@awesolek2 do you think if the two of us got together we could figure out how to set this up using this configuration? It's the main thing still blocking the actual use of the Neptune instance we set up. I'm unscheduled today (2022-02-11) except for the Digital Commons event at noon and a meeting from 2 to 3 PM. I'm also relatively unscheduled on Monday and Tuesday next week.

baskaufs commented 2 years ago

Started working on this 2022-02-14. References:

about Neptune endpoints https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-endpoints.html
some information about target groups and network load balancers: http://chinomsoikwuagwu.com/2020/02/14/AWS-Load-Balancers_How-they-work-and-differences-between-them/

Here are the setup notes:

Click create network name: neptunebalancer1 scheme: Internet-facing IP address type: Dualstack (IPv4 and IPv6 addresses) Left VPC at default (which Andy checked and it was OK) Mappings: Availability zone subnet us-east-1d since that was the zone our Neptune instance 1 is in

The Listeners and routing section is the tricky part. The port 80 listener has to have a default action forwarding to a target group. There were no target groups to select from, so clicked "Create target group". NOTE: the link leads to a page whose breadcrumb is under "EC2". Is that why we only see EC2 instances?

created an IP target group called neptunetargetgroup1 Target type: instances Protocol defaults to TCP and port 80. I think TCP is required for Neptune, don't know whether the port should be 8182, so left it at 80.

Registering targets is optional (only EC2 instances show up, no Neptune). Registered targes ensure that the load balancer routes traffic to this target group. I just clicked next and hoped for the best. It said that it successfully created it.

Returned to the setup of the network balancer and clicked the refresh button. The new target group showed up as an option and I selected it.

Summary: Basic configuration neptunebalancer1

Internet-facing
IPv4

Network mapping VPC vpc-7234050a

us-east-1d
subnet-25d48341

Listeners and routing

TCP:80 defaults to neptunetargetgroup1

Tags vu:owner DiSC

Clicked "Create load balancer"

It successfully created it. Went to the load balancer page and it was provisioning.

baskaufs commented 2 years ago

It is not at all clear how to make this actually do anything. I sent an HTTP GET to neptunebalancer1-ed5863c1d0ef7ec4.elb.us-east-1.amazonaws.com:8182/status and of course nothing happened.

Comparing with the Libra-Canta-1WHIT3CVGOCS4 target group, they used HTTP, port 8182, and IP target type. Their target was to a specific IP address 172.31.0.180. But how can we do this? Neptune only gives a domain name -- there doesn't seem to be any place to find a specific IP address to use.

baskaufs commented 2 years ago

2022-07-18 Cliff, Andy, and I tried again. This time we looked at the available subnets because they actually had numeric IP addresses that we could us. However, when we tried to set up a target group with targets based on IPs, the ones we tried to copy and paste from the subnets were not available. Cliff requested support from the Cloud team.

baskaufs commented 2 years ago

@CliffordAnderson @awesolek2 We should pay attention to the fact that Neptune cannot perform outgoing federated queries without additional setup. See https://aws.amazon.com/blogs/database/benefitting-from-sparql-1-1-federated-queries-with-amazon-neptune/ for details. Namely:

If your query returns an error, you may need to create the correct VPC network settings to allow your Neptune cluster to send outbound requests. Complete the following:
Have a public subnet; you may find it easiest to create a new one
Have a NAT gateway linked to your public subnet
Configure your existing route table to target your NAT Gateway
Create a new route table targeting the Internet Gateway
Associate your new subnet with your new route table
If you have enabled IAM database authentication for your Neptune cluster, you must take this into consideration.

baskaufs commented 2 years ago

Notes from 2022-04-21 meeting with Taylor Riggan (AWS). Also attending, Lindsey Beeson (account manager who lives in Nash and works with VU).

Taylor thinks a solution using the API Gateway would be better. It allows HTTPS access to HTTP endpoints inside the VPC. It also allows throttling if somebody attempts a denial of service or something like that. He's going to try to develop a Cloud Formation template to simplify the setup for us.

He mentioned Charles Ivy (London) who wrote a blog post on NAT gateway (for federated queries?).

He also mentioned that late 3rd quarter 2022 they would be offering Neptune Serverless (similar to Dynamo DB) where you have a certain number of credits and are charged by number of concurrent queries rather than by instance size. That should make it much cheaper to operate.

baskaufs commented 2 years ago

Another note: Jason Bradley is going to be our new Solutions Architect from AWS when he finishes his training.

baskaufs commented 2 years ago

Email from Taylor Riggin 2022-05-06. The information he sent was attached to the email and a copy is in the DiSC SharePoint here

Hi all,

Attached is the architecture and automated deployment for the configuration that we covered recently. Apologies for taking so long to get back to you with this.

In regards to filtering the UPDATE API – Neptune doesn’t allow UPDATE via a GET request. But I think having the filter there is useful in the event that someone mistakenly changes the /sparql route to allow ANY. At the very least it would filter an UPDATE parameter from the request. I’m still thinking about how to potentially do that for requests that include an UPDATE in the body of the request using this solution.

UNDER NDA, we do have plans to launch Action Based Access Control (ABAC) via IAM for Neptune sometime in the next few months. Once that is available, we’ll be able to locked down a Neptune cluster more rigorously for read-only applications.

As always, let me know if you have any questions.

Have a great weekend!

Cheers,

-- Taylor Riggan | Sr. Graph Architect | Amazon Neptune | AWS |

baskaufs commented 2 years ago

Met on 2022-06-01 with Allen Karns, VUIT's AWS cloud architect, and he was able to get things set up properly using the Cloud Formation template provided by Taylor Riggin. The one bump in the road was that I think the S3 bucket needed to store the endpoint URLs that Cloudfront is using had to be set up manually. There was also some issue with one of the availability zones not being usable. However, I don't understand the details enough to remember exactly what was done.

In any case, the endpoint URL: https://5j6diw4i0h.execute-api.us-east-1.amazonaws.com/sparql seems to work properly and can be accessed publicly (but only using HTTP GET since it connects to the read-only endpoint -- POST queries aren't possible at this point).

baskaufs commented 2 years ago

Also can't currently do federated queries since Neptune isn't allowed to access anything outside its availability zone.

HeardLibrary / vandycite

Set up front end to allow public access to query Neptune #64