Investigate GrammaTech ingestion performance troubles. COMPONENT.

cuddihyge commented 3 years ago

I’m wondering what if any knowledge you have of the combinatorics or algorithmic performance of the RACK tools with medium or large data sets? Up to this point I have been consuming relatively small items, but today I began trying to load in a more complex data-set and the system is hanging, sometimes in interesting ways. From the CLI, I just get a ‘hang’ – if I kill it, the traceback points to waiting for a response (trace below in case it is helpful). The context is essentially: ingestion-steps:

{nodegroup: "acert-ingest-component-ids", csv: "COMPONENT.IDs.csv" }
{nodegroup: "acert-ingest-component-definedin", csv: "COMPONENT.DefinedIn.csv" }

$ rack data import --clear import.yaml Clearing graph Success Update succeeded Loading acert-ingest-component-ids... OK Records: 12326 Failures: 0 Loading acert-ingest-component-definedin...

The first step, although time consuming, completes in about half a minute. The second step does not complete in any reasonable amount of time. Furthermore, on the web interface, the same import operation starts out showing “1%” complete, and then eventually falls back to “0%” and stays there seemingly indefinitely.

This has me suspecting some unfortunate combinatorics in the interface. N*lg(N) would be about 13 times slower, but there’s no visible progress at all in that time frame. N^2 would already be 102 hours, and I haven’t known about the problem that long…

Do you have any insights? Is it possible that I need to make some small change, like not declaring some field as optional, in order to have this complete? Or will I simply need to reduce the size of my data set dramatically?

Thank you, Greg

cuddihyge commented 3 years ago

@abha I think this now my highest priority.

Fuseki is taking 6-10 hours to import Greg's 12,000 items. One of the lookup queries is taking 3sec. Multiplied by 12,000 is ~10hr.

Many leads to pursue:

Can I fix the query so Fuseki doesn't choke?
Why didn't my performance testing show this? What's different about this lookup query?
Is it faster in Blazegraph?
Early jugulars suggest SemTK may be able to cache lots of URI's and do less looking-up. This is a multi-week job.

cuddihyge commented 3 years ago

It looks like Blazegraph performs great here. I will be updating the Blazegraph evaluation. It also looks like I could teach SemTK to write a query that Fuseki would like better. That is a bit of a bandaid though.

Just jugular results. Significant work remains to resolve, but technical risk is low given two different solutions.

cuddihyge commented 3 years ago

Bizarre performance test results.

Fuseki works amazingly fast on VALUES { class0 subclass1 subclass2} , it was unusable with rdfs:subclassOf* class0
Blazegraph works well enough with rdfs:subclassOf but very slow with VALUES

I have implemented a hybrid SPARQL generator that checks the connection type and generates the best query for each.

@tuxji is helping create a dev branch of RACK that will have a docker container with the latest SemTK but previous release of the rest of RACK. Stick with Fuseki for now.

Once the container is ready, we'll ask Greg to try it out.

cuddihyge commented 3 years ago

this fix is now incorporated into two containers: gehighassurance/rack-box:dev gehighassurance/rack-box:dev-v4

I've tested both and the performance improvements are quite significant. I've emailed Greg.

ge-high-assurance / RACK

Investigate GrammaTech ingestion performance troubles. COMPONENT. #285