HeardLibrary / vandycite

0 stars 0 forks source link

Develop code and test loading #67

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

Modify some of the existing Python scripts I have to load data into the s3 and then load into the Neptune triplestore. The Getty AAT thesaurus would be a good test - it's got like a couple GB of data. See http://vocab.getty.edu/ for download info.

baskaufs commented 2 years ago

See also http://vocab.getty.edu/doc/#Export_Files for ancillary vocabularies to be loaded.

baskaufs commented 2 years ago

First pass at loader script at https://github.com/HeardLibrary/vandycite/blob/master/neptune/load_neptune.ipynb

baskaufs commented 2 years ago

Load testing notes:

  1. The bulk loader has no way to specify the graph into which the triples are to be loaded. That means that the only way I can think of to add triples to a specific graph using the bulk loader is to use the n-quads format. I've added some code to add a graph IRI to an n-triples file to turn it into an n-quads file.
  2. Tested loading the AAT terms in AATOut_2Terms.nq, which is 648 Mb and contains 4 244 860 quads. The time to load with the bulk loader was 454 s.
  3. I tried deleting the graph containing those triples using SPARQL Update DROP with the graph name. It timed out after 120 s. I was uncertain as to whether it would continue to carry out the DROP command once it was issued even though the server/client interaction timed out. So far, it appears to not have done that, because when I query the database, I'm still getting triples from the graph that was supposed to be dropped, even after something like a half hour.
  4. After increasing the timeout default to 5 minutes, I repeated the drop command and it was completed in 291 s (4.8 min), so the timeout was barely long enough for a graph this size.
  5. Tried using SPARQL Update LOAD instead of using the bulk loader. The load time was 289 s (basically the same as the SPARQL Update DROP command).

Take-aways: Interestingly, there didn't seem to be any advantage in using the bulk loader; it was actually faster to just use SPARQL. The main advantage seems to be that the loader never times out. But since it takes just as long to do a DROP command with SPARQL Update as to LOAD them, the size of graphs can't be so big that loading them with Update would time out, otherwise, if you loaded them with the loader, you'd never be able to delete them.

baskaufs commented 2 years ago

Information about setting the timeout (default is 120s):

parameter is: neptune_query_timeout

Page on parameter groups

Note: can't just change the default, must create a new parameter group with the value to override the default.

Created a new cluster parameter group called neptune1parametergroup and changed the timeout to 300,000 (5 min). Since a restart is required to put the new parameters into effect, told the cluster to stop, which took just about forever.

However, when I re-opened the SSH tunnel and checked the cluster status, it was still using 120,000 for the timeout.

Tried creating a new "DB parameter group" (vs. DB cluster parameter group) with the new timeout. This should be necessary since instances are supposed to inherit parameters from the cluster settings. This still didn't work.

The instructions say to reboot every db instance in the cluster. I thought that stopping and restarting would be the same thing, but maybe rebooting is different. That still had no influence on the reported clusterQueryTimeoutInMs when I requested the cluster status.

OK, the problem was that I needed to edit the cluster configuration: databases > Modify cluster Under Additional configuration need to drop down and select the new DB cluster parameter group that was created. Click continue. Select "immediately" under scheduling, then click modify. I then rebooted the two instances without restarting the actual cluster and that seemed to do the trick, with the cluster reporting the correct new timeout time.

baskaufs commented 2 years ago

Links from @CliffordAnderson:

Here are my thoughts from a 2022-03-02 email:

With respect to my previous email, it doesn't sound like this would solve the issues with using the Neptune bulk loader, since it's doing the loading with specific HTTP requests (PUT, POST, etc.) involving data payloads rather than having the POST data content being JSON loader instructions. So the bulk loader would still have the same limitations regarding only n-quads being usable for specifying the graph. But it doesn't seem like the bulk loader is the best solution anyway since it seems to be slower than just using SPARQL Update. (It seems to me that when we had that earlier Neptune trial 5 years ago or so I ran these same tests and the bulk loader was faster than SPARQL Update, but that's not what I got this time.)

However, the Graph Store Protocol does offer an alternative loading method. The one thing that's a bit unclear to me is how one would access the data payload for the PUT and POST commands. Because of the limitations imposed by the VPC, one could not PUT/POST the data from a local computer issuing the HTTP request unless it was going through the SSH tunnel (which I could do). I don't see any way to use the Graph Store Protocol and do the loading from an s3 bucket as with SPARQL Update.

I think the key thing would be to understand whether these HTTP requests would be subject to the same timeout limits as the SPARQL Update commands I was testing before. I would guess that they would be. In that case, there would be a severe dependency on network speed. Certainly the load from my home computer would be orders of magnitude slower than loading from an s3 bucket because my upload speed is glacial. (This is painfully apparent when I try to upload 100's of MB to s3 -- basically impossible from at home so I had to do the upload from my office where the wifi/network was much faster.). So the Graph Store Protocol wouldn't be practical over slow network connections.

One thing that I am very interested in is the use of HTTP PUT to actually replace a graph. One of the downsides of using graphs for versioning is that when you want to update to a new version, you have to drop the old version and load the new one. If it takes 5 minutes to drop the graph and another 5 to load the new one, there's at least 5 minutes where the graph is unavailable. Under current circumstances, that wouldn't matter much, but if the data were being used to supply data for an application (like what Dave M. is planning), it might be unacceptable to have it offline for 5 minutes. I would be curious to know if the PUT command leaves the old graph in place until the new one has been loaded, and then applies the graph name to the new data. That would be a better solution than DROP/LOAD.