ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Programmatically optimize the default graph #762

Closed cuddihyge closed 1 year ago

cuddihyge commented 2 years ago

rack cli program to shut down fuseki, run stats, restart.

Ptival commented 2 years ago

From what I am seeing, it appears that Fuseki only ever picks a reorder transform (i.e. decides how triples will be reordered for the current database) when a DatasetGraphTDB object is built, here:

https://github.com/apache/jena/blob/5a297a868562be8fd5880ac1f78faec65aff5ee1/jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/store/TDB2StorageBuilder.java#L112

Even uploading additional data does not seem to re-trigger this path, so unless we find a means of triggering a refresh dynamically, it seems like we'd have to restart Fuseki for it to acknowledge a new stats.opt file.

Ptival commented 2 years ago

@tuxji

We are considering our options for setting up the Fuseki optimizer for users of RACK.

The crux of the problem is that the optimization relies on computing statistics by acquiring the lock to the TDB database, which can only be done when Fuseki is not running. We also need to make sure that the created file ends up at the exact correct location, that is, in .../run/databases/<dataset>/Data-0001/stats.opt.

We are considering our options for automating this process, both for Docker users, and more generally VM users. The process would need to be as such:

  1. Shut down Fuseki.
  2. Run tdbstats and redirect the output to stats.opt.
  3. Restart Fuseki.

For Docker users, it seems like the process could be, somewhat easily, automated in a script that lives inside the Docker image. We would just need to trigger the script, and it would know how to stop/restart Fuseki (I think via systemctl), be set up to have access to tdbstats, and know where to dump stats.opt.

For VM users, can we set up the same workflow, or is there some environment diversity that makes this harder?

Happy to discuss the details in person if that would help.

tuxji commented 2 years ago

Both Docker and VM images are nearly identical, so a script that works in the Docker container probably will work in the VM too. Go for it, test in Docker container, and see how well it runs in the VM afterwards.

Ptival commented 2 years ago

@tuxji

I tried building the Docker image, first natively on my Macbook, and ran into:

==> docker: ERROR: Pillow-9.0.1-cp310-cp310-macosx_10_10_universal2.whl is not a supported wheel on this platform.

Alright, so I switched to a standard Ubuntu VM and still got:

==> docker: ERROR: Pillow-9.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is not a supported wheel on this platform.

So, are you just building the Docker image from within another Docker image? Or is there a trick to get a compatible wheel from some other system?

tuxji commented 2 years ago

How are you building the Docker image? I usually let CI build the Docker image because there are so many commands which need to be run correctly to build the Docker image manually. If you look at a recently completed CI job, drill into both the cache/Download RACK files and build/Build rack-box dev image steps and expand everything expandable, you'll see how many individual commands these steps ran. You can compare these commands and their output with the commands you're running and their output to see if a difference might explain why you're getting the error messages.

If I had to guess what happened, my guess would be that you are using a more recent version of Python (3.10) than GitHub's Ubuntu runner (Python 3.8) and that old Pillow 9.0.1 version hasn't been built for Python 3.10 because it's not the latest version of Pillow. I remember that CI encountered a build problem with a newer Pillow version a few month ago and Paul had to work around the problem by forcing the older 9.0.1 version to be used:

interran@GH3WPL13E:~/ARCOS/RACK$ git log -SPillow
commit d4b6deae01df79d3c4525d9c06299d3d9de97175
Author: Paul Cuddihy <cuddihy@research.ge.com>
Date:   Mon Jul 11 10:44:14 2022 -0400

    Try forcing Pillow==9.0.1 for the github CI build, and reference the semtk-python3 build which does the same.

I searched for Pillow in the CI job's cache/Download RACK files step and found these lines of output:

Collecting Pillow==9.0.1
  Using cached Pillow-9.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB)
Saved ./wheels/Pillow-9.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing ./RACK/cli/wheels/Pillow-9.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Yours was using cp310 and the CI was using cp38. I also found the CI job that failed when Pillow's newer version came out, and it failed for the same reason your commands failed: the newer Pillow version's wheel wasn't available for the CI's platform (Ubuntu 20.04 and Python 3.8).

ERROR: Pillow-9.2.0-cp38-cp38-manylinux_2_28_x86_64.whl is not a supported wheel on this platform.

According to PyPI, 9.2.0 is still the latest Pillow version. Please try removing 9.0.1 from the requirements.txt and see if you are able to build the Docker image this time. If that works, we should remove the 9.0.1 from the requirements.txt in both RACK and semtk-python3 and transition the CI from ubuntu-20.04 to ubuntu-22.04 to make RACK more future-proof.

Ptival commented 2 years ago

@cuddihyge @tuxji

From what I can see (/etc/fuseki/configuration/RACK.ttl), we're set up to use TDB1. In my personal experiences, I've been using TDB2.

Do we have reasons to favor one over the other?

tuxji commented 2 years ago

I don't think anyone ever discussed TDB1 vs TDB2 as a team decision. I would caution that changing from TDB1 to TDB2 is changing RACK's entire database engine from one implementation to another implementation. Some queries may see performance worsen as well as improve although large bulk loads should happen faster. I suggest doing the changeover after v11.0's release, not 2 weeks before v11.0's release. However, a single line change is all that we need to switch the database type from tdb to tdb2:

rack-box/scripts/install.sh
132:curl -Ss -d 'dbName=RACK' -d 'dbType=tdb' 'http://localhost:3030/$/datasets'

Committing that change to the main branch and pulling a brand new image (docker pull gehighassurance/rack-box:dev) will allow people to start testing their RACK box with TDB2 as the database engine.

weisenje commented 1 year ago

Gathered performance data showing 1) using the default graph speeds up some queries 2) running the optimization routine on the default graph further speeds up a subset of these queries. The queries below were selected because they took at least a few seconds to return in the non-default graph, and thus were candidates for improvement.

Image

Ptival commented 1 year ago

I have prepped this PR which should add tdbstats in our RACK images, and the optimize script inside the CLI directory.

weisenje commented 1 year ago

PR above is merged. Confirmed in the Dev Docker container that optimize.sh is now present and runs successfully.