Determine most effective way to load triples

baskaufs commented 2 years ago

The goal here is to figure out a way to script most of the process of moving triples from some source (local hard drive, GitHub) into the triplestore without a bunch of manual button clicks or opening and running a Sagemaker Jupyter notebook.

baskaufs commented 2 years ago

I haven't finished reading the documentation, but it appears that a strategy to use in production is to use the command line interface (CLI) to load files into S3, then control the loading process with commands to Neptune through the CLI. See this page for examples. Because the process involves the CLI, it could be scripted using a shell script or programming language like Python.

baskaufs commented 2 years ago

OK, this seems to be more complicated than I thought. The data loading example states the following: "Unless stated otherwise, you must follow these steps from an Amazon Elastic Compute Cloud (Amazon EC2) instance in the same Amazon Virtual Private Cloud (VPC) as your Neptune DB instance."

That means that one can't just issue the POST commands from a local machine -- one would have to SSH into the EC2 instance and make the commands from there.

There are CLI commands for Neptune here, but they are all related to managing the cluster (delete databases, setting tags, restarting the cluster, etc.) and not actually for interacting with the triplestore using either the loader or SPARQL UPDATE.

baskaufs commented 2 years ago

Having to interact solely with the database by typing stuff into a Sagemaker Jupyter notebook prevents any kind of meaningful automation of the process. However, the instructions for running locally-installed notebooks has instructions for connecting a local notebook, including authentication instructions.

In particular, they say:

"When connecting the graph notebook to Neptune, make sure you have a network setup to communicate to the VPC that Neptune runs on. If not, you can follow this guide."
In the authentication section, they say "If you are running a SigV4 authenticated endpoint, ensure that your configuration has auth_mode set to IAM: [example]". The example is identical to the generic notebook configuration example above it, except that the auth_mode is set to "IAM" instead of "DEFAULT".
The comment that follows says "Additionally, you should have the following Amazon Web Services credentials available in a location accessible to Boto3", which is the standard setup for doing CLI interactions via Python.

The key thing seems to be using an EC2 instance in the same VPC as the intermediary since it's not possible to connect to Neptune directly. These instructions explain about setting up an EC2 proxy server in the VPC, then setting up an "SSH tunnel" to securely forward traffic to the VPC. Step-by-step instructions are given on that page.

The local connection instructions also mention "Note that this README is not an official recommendation on network setups as there are many ways to connect to Amazon Neptune from outside of the VPC, such as setting up a load balancer or VPC peering." It seems like the load balancer would be read-only, so for actually writing data, the "VPC peering" (whatever that is) might be the only other option.

I'm not sure enough about how the SSH tunneling would work to know if it could be fully automated by local script (like entering SSH from a script?) One option might be to just write the loading script (shell or Python) then move it to the EC2 instance and run it from there. That might be annoying if the files being loaded into S3 are local, since you'd have to either have to separately run the upload command from a separate (local) script, or copy them over to the EC2. However, if they were on GitHub, they could be loaded via a URL and the part of the script to do the S3 upload could then reside on the EC2 instance along with the part of the script that controlled the actual loading commands.

baskaufs commented 2 years ago

Hmm. I might take some of that back. It looks like the "tunnel" setup involves some tricky mapping of a local port to the EC2 port? ... or Neptune port? ... or both?

baskaufs commented 2 years ago

VPC peering appears to be solely for establishing a network connection between different VPCs (i.e. ones in different regions). So this doesn't solve the problem of this issue.

baskaufs commented 2 years ago

Set up EC2 instance. It defaulted to the correct VPC. I used a t2.nano, which it said wasn't eligible for free tier even though it's smaller than t2.micro (the free one). @CliffordAnderson Was that a mistake, or are we past the whole "free tier" thing anyway? Downloaded the .pem file as NeptuneSSHtunnel.pem.

To test, they say to issue:

ssh -i path/to/keypairfilename.pem ec2-user@yourec2instanceendpoint

It wasn't clear which of the EC2 addresses to use for the endpoint, so I chose "Public IPv4 address". However, when I issued the command, I got this error:

Permissions 0644 for '/Users/baskausj/NeptuneSSHtunnel.pem' are too open. It is required that your private key files are NOT accessible by others. This private key will be ignored.

Fix was here:

chmod 400 /Users/baskausj/NeptuneSSHtunnel.pem

which disallows any changes to the file, but that should be OK. I was then able to SSH in using

ssh -i /Users/baskausj/NeptuneSSHtunnel.pem ec2-user@35.173.230.91

and got the $ prompt for the server.

From the command line on the EC2, tried sending a command to the Reader endpoint as described in part 2 step 2:

curl https://triplestore1.cluster-ro-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/status

but for whatever reason, it didn't seem to complete the request or hung up, and I had to CTRL-Z to kill it. I then tried the Instance endpoint:

curl https://triplestore1-instance-1.cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/status

but it didn't work either. No error message or anything -- it just sat there.

CliffordAnderson commented 2 years ago

Yes, we're not eligible for the free tier. I think it's still a firewall issue because the HTTP commands you are issuing still require external access, even if being used only internally. That's my guess at any rate.

baskaufs commented 2 years ago

Well, that could be it. Part of the problem is that I'm not exactly following the procedure they listed on that page. They want me to remap my localhost IP to the Neptune endpoint prior to doing some port forwarding. I don't understand the implications of that and can't afford to mess up my localhost settings. I was hoping to skip the port forwarding by just issuing the command directly from the EC2 command line, but clearly something is wrong. I'm suspicious I am not actually using the right IP address or port because nothing happened -- not even an error message. But I'm out of time to work on this now and probably need help from somewhere.

baskaufs commented 2 years ago

NOTE about s3 buckets for loading. Loading using SPARQL Update in the Jupyter notebook is successful ONLY if the bucket is public.

%%sparql

LOAD <https://iiif-library-manifests.s3.amazonaws.com/format.nq> INTO GRAPH <http://format>

baskaufs commented 2 years ago

Following instructions for bulk data loading. We've already set up the s3 endpoint. Data are already loaded in the public bucket iiif-library-manifests.

Attempted this request:

[ec2-user@ip-172-31-90-204 ~]$ curl -X POST \
>     -H 'Content-Type: application/json' \
>     https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/loader -d '
>     {
>       "source" : "s3://iiif-library-manifests/format.nq",
>       "format" : "nquads",
>       "iamRoleArn" : "arn:aws:iam::555751041262:role/NeptuneLoadFromS3",
>       "region" : "us-east-1",
>       "failOnError" : "FALSE",
>       "parallelism" : "MEDIUM",
>       "updateSingleCardinalityProperties" : "FALSE",
>       "queueRequest" : "TRUE"
>     }'

Response was:

Failed to connect to triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com port 8182 after 129311 ms: Connection timed out

Fell back to the even simpler request suggested and tried earlier with the same result:

[ec2-user@ip-172-31-90-204 ~]$ curl https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/status
curl: (28) Failed to connect to triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com port 8182 after 129760 ms: Connection timed out

baskaufs commented 2 years ago

Put in a ticket (852491) with ITS for help with this 2022-02-08.

baskaufs commented 2 years ago

ITS got back to me with this response:

I worked with Allen and we noticed that the Neptune EC2 did not have an IAM role and the VPC Endpoint had a default security group but no access to the EC2 for Neptune. We added a rule allowing access for the security group and EC2. It should work now.

That solved the problem of inability of the EC2 to talk to Neptune.

I was able to load data using SPARQL Update by issuing the following command from the EC2 command line:

curl https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/sparql -d "update=LOAD <https://iiif-library-manifests.s3.amazonaws.com/format.nq> INTO GRAPH <http://format>"

The JSON response was the same as what I got from the Jupyter notebook. See this page as the reference for using SPARQL Update to do the load.

Tried issuing this version of the previous loader request:

curl -X POST \
    -H 'Content-Type: application/json' \
    https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://iiif-library-manifests/format.nq",
      "format" : "nquads",
      "iamRoleArn" : "arn:aws:iam::555751041262:role/neptuneloadfroms3",
      "region" : "us-east-1",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE"
    }'

and got the response "Failed to start new load from the source s3://iiif-library-manifests/format.nq. Couldn't find the aws credential for iam_role_arn: arn:aws:iam::555751041262:role/neptuneloadfroms3"

Sent the message to ITS (copied Andy) to see if they could figure out what was wrong with the role ARN.

baskaufs commented 2 years ago

Changed accessiblity of triplestore-upload bucket to public by adding this bucket policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicAccess",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::triplestore-upload/*"
        }
    ]
}

That should make it usable for loading Neptune.

Deleted the sparql-upload bucket I used in the past because it was in us-east-2.

baskaufs commented 2 years ago

@CliffordAnderson @awesolek2 I was cleaning up s3 buckets by deleting public buckets I'm not using any more, since the ITS people don't like having them around. I think it is safe to delete iiif-library-manifests because it got replaced by iiif-manifest.library.vanderbilt.edu when we got the iiif-manifest subdomain to work, right?

baskaufs commented 2 years ago

Loading seems to be allowed based on standard file extensions and not necessarily on media type. Loaded a .jsonld file with no problem from s3, even though I didn't set the content-type when I uploaded it. Did not test whether the media type is a fallback.

Kinda cool viz, but the labeling is pretty deficient: graph

baskaufs commented 2 years ago

Here's the resolution to the loader error I was getting: The neptuneloadfroms3 role did exist but it wasn't "attached to the NeptuneDB". That solved the problem.

The final HTTP request to do the loading from the new triplestore-upload bucket in us-east-1 is:

curl -X POST \
    -H 'Content-Type: application/json' \
    https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://triplestore-upload/format.nq",
      "format" : "nquads",
      "iamRoleArn" : "arn:aws:iam::555751041262:role/neptuneloadfroms3",
      "region" : "us-east-1",
      "failOnError" : "FALSE",
      "parallelism" : "MEDIUM",
      "updateSingleCardinalityProperties" : "FALSE",
      "queueRequest" : "TRUE"
    }'

baskaufs commented 2 years ago

@CliffordAnderson not sure if you're getting notifications generically for this issue, but with the help of @awesolek2 and ITS, I'm now able to load triples using either standard SPARQL Update commands or the Neptune specific loader (supposed to be a lot faster).

So this issue is resolved except for one thing: I haven't figured out how to configure the "passthrough" from the command line on my local computer to the EC2 without remapping my localhost port (which I'm scared to do).

I'm going to leave it go for now since it's not blocking anything else in this milestone, but I think it eventually needs to be figured out in order to create a sustainable workflow. As things currently stand, I'm leaving a terminal window open that's SSH'ed into the EC2 and manually typing or pasting things into that window. We need to have the capacity to just use a shell or Python script on the local computer to control the upload process (upload to s3, then POST the loading command via the EC2). I think it's doable, but I probably will need you or somebody else who understands port mapping better than I do to help set up the "tunnel" correctly.

CliffordAnderson commented 2 years ago

Thanks for the update. I'm not sure that I have a better understanding, but I bet we can find a tutorial somehwere to help us. At any rate, I'm glad we have Neptune up and (mostly) functional now.

baskaufs commented 2 years ago

Yes, I think we are in a much better place, but #64 is still a major thing blocking actually starting to use it.

baskaufs commented 2 years ago

Did some research on the port mapping. Did not find much except this which basically provided the same solution as this. However, I'm thinking that what was concerning me was a misunderstanding of what

127.0.0.1 localhost YourNeptuneEndpoint

meant in the hosts file. I was assuming YourNeptuneEndpoint was an argument, but actually I'm now pretty sure that it's just a shortcut to say that an additional domain name is being mapped to the localhost IP 127.0.0.1 just as if the mapping to 127.0.0.1 had been added on a separate line (as seen in some of my reading about using the hosts file). So the dire warning

# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.

can be safely ignored, since localhost is still being mapped to the loopback IP address.

So I went ahead and changed my host file by changing

127.0.0.1       localhost

to

127.0.0.1       localhost       triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com

then flushing the cache with

sudo dscacheutil -flushcache

as suggested by the StackOverflow post. I restarted my computer since I'd previously done some unsuccessful attempts to map port 8182 and didn't want that to interfere. I then issued the command

ssh -i /Users/baskausj/NeptuneSSHtunnel.pem ec2-user@35.173.230.91 -N -L 8182:triplestore1-instance-1.cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182

according to the instructions for setting up the SSH tunnel. However, those instructions said that "An initial successful connection will ask you if you want to continue connecting? Type yes and enter." I did not get a successful connection. Rather, the terminal window stayed in never-never land with no response and no prompt, just like when I couldn't connect directly from the EC2 terminal window before. When I used CTRL-Z to try to get out of it, I got the message

suspended  ssh -i /Users/baskausj/NeptuneSSHtunnel.pem ec2-user@35.173.230.91 -N -L

which seemed to indicate that SSH had at least set up a connection to the EC2, although the tunneling seems to have failed. Attempting to reach Neptune via cURL to it's IP address and request its status failed, as expected, with a timeout.

At this point, I'm going to leave this lie. If the Python SDK or accessing directly via POST using a load balancer don't work, we can revisit this and maybe get somebody from the Cloud team who's done SSH tunneling to help figure out why the instructions don't work. I'm highly suspicious it's a permissions issue again, maybe with the EC2 this time.

baskaufs commented 2 years ago

Checked out https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-python-utils for a potential Neptune Python SDK. Unfortunately it's designed for using Gremlin and Tinkerpop rather than RDF and SPARQL. It might potentially be usable to control bulk loading, although it doesn't show a way to specify the format of the file in the s3 bucket as was the case in the POST examples above. So I'm not sure if one could load N3 or JSON-LD formatted triples with it.

In any case, further experimentation with this is blocked by #64, since the examples involve connecting through a network load balancer, and we don't have that set up yet.

baskaufs commented 2 years ago

Found an additional example of a blog post for setting up a "bastion host", which I think is the same thing as the EC2 in SSH tunneling. The reference is: https://blog.codemine.be/posts/2020/up-and-running-with-aws-neptune/

Everything through "Check your ec2-to-neptune connection" is the same. "Prime your local system" is different, however.

It has the additional step of setting up the SSH configuration file (~/.ssh/config) by adding the following:

host neptune-demo
 ForwardAgent yes
 User ec2-user # when using Amazon Linux
 HostName <your-ec2-address>
 IdentitiesOnly yes
 IdentityFile ~/.ssh/<your-ec2-key-file>.pem
 LocalForward 8182 <cluster-endpoint-neptune>:8182

which I implemented as:

host neptune
 ForwardAgent yes
 User ec2-user
 HostName 35.173.230.91
 IdentitiesOnly yes
 IdentityFile ~/NeptuneSSHtunnel.pem
 LocalForward 8182 triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182

This seems to be the config file alternative to trying to issue the command on a single line.

The addition to the /etc/hosts file is the same.

The command to start the session is then:

ssh neptune -N

(The first time I tried to run it, I got an error about the #, so I know it's finding the config file.) Executing the command caused no response or prompt in the terminal window. Opened another terminal window and entered:

curl https://triplestore1.cluster-cml0hq81gymg.us-east-1.neptune.amazonaws.com:8182/status

This time I got a response from the server! So it's working! To exit the session, I used CTRL-Z .

baskaufs commented 2 years ago

Talked to Dale about how to get the command prompt back after opening the SSH tunnel (so that I could script it). The fix was to add an "&" after the -N flag:

ssh neptune -N &

This give a response, which is the process ID e.g.

[1] 71634

That ID could be captured either by redirecting to a file, or in Python there may just be a way to grab it as a variable. When finished with the connection, one can kill the process by issuing the command:

kill -9 71634

That will close the SSH tunnel. Dale said if the tunnel needs to be open for a long time (for example if the Internet is lost), you can use "NOHUP", but I don't think that will be necessary because it only needs to be open long enough to do the loading and then shut down.

HeardLibrary / vandycite

Determine most effective way to load triples #58