ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Investigate GrammaTech ingestion performance troubles. COMPONENT. #285

Closed cuddihyge closed 3 years ago

cuddihyge commented 3 years ago

I’m wondering what if any knowledge you have of the combinatorics or algorithmic performance of the RACK tools with medium or large data sets? Up to this point I have been consuming relatively small items, but today I began trying to load in a more complex data-set and the system is hanging, sometimes in interesting ways. From the CLI, I just get a ‘hang’ – if I kill it, the traceback points to waiting for a response (trace below in case it is helpful). The context is essentially: ingestion-steps:

$ rack data import --clear import.yaml Clearing graph Success Update succeeded Loading acert-ingest-component-ids... OK Records: 12326 Failures: 0 Loading acert-ingest-component-definedin...

The first step, although time consuming, completes in about half a minute. The second step does not complete in any reasonable amount of time. Furthermore, on the web interface, the same import operation starts out showing “1%” complete, and then eventually falls back to “0%” and stays there seemingly indefinitely.

This has me suspecting some unfortunate combinatorics in the interface. N*lg(N) would be about 13 times slower, but there’s no visible progress at all in that time frame. N^2 would already be 102 hours, and I haven’t known about the problem that long…

Do you have any insights? Is it possible that I need to make some small change, like not declaring some field as optional, in order to have this complete? Or will I simply need to reduce the size of my data set dramatically?

Thank you, Greg

cuddihyge commented 3 years ago

@abha I think this now my highest priority.

Fuseki is taking 6-10 hours to import Greg's 12,000 items. One of the lookup queries is taking 3sec. Multiplied by 12,000 is ~10hr.

Many leads to pursue:

cuddihyge commented 3 years ago

It looks like Blazegraph performs great here. I will be updating the Blazegraph evaluation. It also looks like I could teach SemTK to write a query that Fuseki would like better. That is a bit of a bandaid though.

Just jugular results. Significant work remains to resolve, but technical risk is low given two different solutions.

cuddihyge commented 3 years ago

Bizarre performance test results.

I have implemented a hybrid SPARQL generator that checks the connection type and generates the best query for each.

@tuxji is helping create a dev branch of RACK that will have a docker container with the latest SemTK but previous release of the rest of RACK. Stick with Fuseki for now.

Once the container is ready, we'll ask Greg to try it out.

cuddihyge commented 3 years ago

this fix is now incorporated into two containers: gehighassurance/rack-box:dev gehighassurance/rack-box:dev-v4

I've tested both and the performance improvements are quite significant. I've emailed Greg.