ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Evaluate Blazegraph opensource triplestore #267

Closed cuddihyge closed 3 years ago

cuddihyge commented 3 years ago

Blaze graph is a higher performance opensource graph. It is GPL so we don't want to distribute it, but we could provide instructions for users.

For this task: (1) download and run blazegraph (2) wire up enough semtk code to ingest and run queries (3) kick the tires with basic performance tests

Decide whether it is good to complete: a) upload owl functions through semtk b) construct query results c) other odds-and ends

cuddihyge commented 3 years ago

After a lot of work, it seems we're probably just as well sticking with Fuseki for now.

  1. Fuseki did better than expected up to 2M triples, which is several hundred thousand rows of data
  2. Blazegraph was faster here, slower there; I don't feel compelled to make a change

Results are checked in to documentationFiles/performance-to-2M-fuseki-blaze-virt.xlsx

cuddihyge commented 3 years ago

After taking a couple week off of this one, I'm re-opening it.

I found a possible flaw in my test. New results are coming out differently. There are many variables to consider, so I need to re-sort this out.

cuddihyge commented 3 years ago

My early guess is that PerformanceTest.java was allowing some internal SemTK data structures (ImportSpecHandler) to be re-used. This was allowing it to skip a COUNT query that is very expensive in Fuseki.

It will take 1-2 days to re-run all tests to 2,000,000 triples.

If the COUNT query is the problem,

  1. fuseki will be back in the doghouse and blazegraph may be much quicker
  2. UNLESS I experiment with changing the COUNT query to an ASK. (I believe it is a COUNT LIMIT 1.)
cuddihyge commented 3 years ago

I've re-run enough results to trust my original results.

cuddihyge commented 3 years ago

Moving this back to in progress AGAIN. I think the bug from Greg has exposed the type of query Fuseki is really bad at (graph traversal queries like subclassOf*) and this type of query had slipped through my performance test. I'll re-run performance tests, that jugulars suggest will prove BlazeGraph is much better at real-life data ingestion where it needs to do lookup on something like REQUIREMENT which might be subclassed.