etf-validator / governance

ETF Steering Group and the Technical Committee documents
1 stars 2 forks source link

Performance improvement for tests with small spatial test data sets [draft] #76

Open jonherrmann opened 5 years ago

jonherrmann commented 5 years ago

ETF Improvement Proposal (EIP)

An EIP is a GitHub issue which proposes an improvement of the ETF validator. It must follow the template below and provide at least information for the first 4 sections.

Background and Motivation:

It takes some time to start the tests, also because the hard disk must be accessed to write index structures.

Proposed change

Small spatial data sets could be tested in-memory.

It should be possible to configure a limit for the data set size. If the data set size is below the limit and the memory utilisation is in a non-critical status (at least the size of the data set is free in RAM and is occupied to a maximum of 50 %), the test is executed in-memory.

Alternatives

-

Funding

-

Additional information

-

carlospzurita commented 4 years ago

We have research on the BaseX documentation, about the database creation option MAINMEM=true. This option can be configured on the BsxDataStorage class from the repository https://github.com/etf-validator/etf-bsxds, including it on the method initBsxDatabase().

Applying this change would cause all the unit test to fail when executing ./gradlew build jar. Attached you can find the report from jUnit. As you can see, this change causes a series of exception in the creation of the database on the validator startup, given that every operation relies on the disk representation of the XML files.

Additionally, this option can only be activated when recreating the database, which can't be done on runtime. It would be needed to create a second database, and rewire it to the rest of the components to use this auxiliary in-memory database. Also, a method to move the test results from the in-memory database to the main storage service should be put in place, to keep the consistency, and generate the XML representation on the file system.

jonherrmann commented 4 years ago

This option can be configured on the BsxDataStorage class from the repository https://github.com/etf-validator/etf-bsxds, including it on the method initBsxDatabase().

Since the idea is to speed up the tests, the change would be made in the test driver -and not in the data storage.

Additionally, this option can only be activated when recreating the database, which can't be done on runtime.

Correct, which is no problem when creating the databases for new test objects.

It would be needed to create a second database, and rewire it to the rest of the components to use this auxiliary in-memory database. Also, a method to move the test results from the in-memory database to the main storage service should be put in place, to keep the consistency, and generate the XML representation on the file system.

Please have a closer look at the test driver.

carlospzurita commented 4 years ago

We added the MAINMEM option on the class constructor etf-bsxtd/src/main/java/de/interactive_instruments/etf/testdriver/bsx/partitioning/DatabasePartitioner.java in the BaseX TestDriver, as you can see in this code snippet

public DatabasePartitioner(long maxDbSizeSizePerChunk, final Logger logger,
            final String dbName, final int filenameCutIndex) throws BaseXException {
        this.dbBaseName = dbName;
        this.logger = logger;
        this.filenameCutIndex = filenameCutIndex;
        this.maxDbSizeSizePerChunk = maxDbSizeSizePerChunk;
        this.currentDbName = dbBaseName + "-0";

        new org.basex.core.cmd.Set("AUTOFLUSH", "false").execute(ctx);
        new org.basex.core.cmd.Set("TEXTINDEX", "true").execute(ctx);
        new org.basex.core.cmd.Set("ATTRINDEX", "true").execute(ctx);
        new org.basex.core.cmd.Set("FTINDEX", "true").execute(ctx);
        new org.basex.core.cmd.Set("MAXLEN", "160").execute(ctx);
        // already filtered
        new org.basex.core.cmd.Set("SKIPCORRUPT", "false").execute(ctx);
        // In memory option for metadata testing
        new org.basex.core.cmd.Set("MAINMEM", "true").execute(ctx);
        new CreateDB(currentDbName).execute(ctx);
    }

For testing purposes, we built this jar separatedly and included it on the WAR file of the webapp. The TestDriver is loaded correctly on the ETF startup. We create a new TestRun with the TestSuite https://github.com/inspire-eu-validation/ets-repository/blob/v1.0.8/metadata/2.0/common/ets-md-common-bsxets.xml . This TestRun ends abruptly, generates a TestReport with this message imagen

On the TestRun log, we only find this information

12.12.2019 13:04:09 - Preparing Test Run Test run on 14:03 - 12.12.2019 with test suite Common Requirements for ISO/TC 19139:2007 based INSPIRE metadata records. (initiated Thu Dec 12 13:04:09 UTC 2019)
12.12.2019 13:04:09 - Resolving Executable Test Suite dependencies
12.12.2019 13:04:09 - Preparing 1 Test Task:
12.12.2019 13:04:09 -  TestTask 1 (b3c4c4a9-f656-476e-8b41-0aadac9173f7)
12.12.2019 13:04:09 -  will perform tests on Test Object 'csw.xml' by using Executable Test Suite 'Common Requirements for ISO/TC 19139:2007 based INSPIRE metadata records. (EID: 59692c11-df86-49ad-be7f-94a1e1ddd8da, V: 0.1.1 )'
12.12.2019 13:04:09 -  with parameters: 
12.12.2019 13:04:09 - testRunTags = 
12.12.2019 13:04:09 - tests_to_execute = .*
12.12.2019 13:04:09 - files_to_test = .*
12.12.2019 13:04:09 - Test Tasks prepared and ready to be executed. Waiting for the scheduler to start.
12.12.2019 13:04:09 - Setting state to CREATED
12.12.2019 13:04:09 - Changed state from CREATED to INITIALIZING
12.12.2019 13:04:09 - Starting TestRun.1798a7ac-8301-4f05-8c9f-9b8222879763 at 2019-12-12T13:04:10Z
12.12.2019 13:04:10 - Changed state from INITIALIZING to INITIALIZED
12.12.2019 13:04:10 - TestRunTask initialized
12.12.2019 13:04:10 - Creating new tests databases to speed up tests.
12.12.2019 13:04:10 - Skipping schema validation because no schema file has been set in the test suite. Data are only checked for well-formedness.
12.12.2019 13:04:11 - Releasing resources
12.12.2019 13:04:11 - Changed state from INITIALIZED to RUNNING
12.12.2019 13:04:11 - Duration: 1sec
12.12.2019 13:04:11 - TestRun finished
12.12.2019 13:04:11 - Changed state from RUNNING to COMPLETED

On the ETF log we don't find anymore information on this message

2019-12-11 11:17:11.559 [qtp1973538135-15] INFO  d.i.e.w.c.TestRunController - TestRun 'Test run on 12:16 - 11.12.2019 with test suite Common Requirements for ISO/TC 19139:2007 based INSPIRE metadata records. (EID: 247fcf0f-f41c-4666-a83a-f7d63a5acd41 )' initialized

2019-12-11 11:17:20.434 [qtp1973538135-56] INFO  d.i.e.w.c.TestRunController - Test Run completed, notifying web client

Looking through the BaseX documentation, the flush operation is not supported, so we commented out the method flushAndOptimize

private static void flushAndOptimize(final Context ctx) throws BaseXException {
        //new Flush().execute(ctx);
        //new OptimizeAll().execute(ctx);
        //new Close().execute(ctx);
    }

But the results are still the same.

If you could share any insight on how to proceed with this issue, it will be very helpful.

jonherrmann commented 4 years ago

Please use a debugger to determine where this exception comes from.

carlospzurita commented 4 years ago

After connecting the debugger to an ETF instance running locally, we found the exception came from the DatabasePartitioner class. The error was in the creation of the database index with the option FullText set to True. In https://github.com/BaseXdb/basex/blob/89059c609f582b20aad4283dbb85fef8dbe90548/basex-core/src/main/java/org/basex/data/MemData.java#L93 , as you can see, if the FullText index is activated, it throws a NO_MAINMEM error.

We changed this property to False, and after this change, the Test proceeded past the index drop. But then again we found another exception on the execution of the TestSuit. In this case, the exception came from https://github.com/BaseXdb/basex/blob/f7a5492c46d55e1c1f58df24b8ed9567c176e8c1/basex-core/src/main/java/org/basex/core/cmd/Open.java#L92 , and the message shown on the application is

[bxerr:BXDB0002]` Database 'etf-tdb-b2d8df5c-69d2-4f72-aac2-be61cb8a863d-0' was not found.

We need to keep digging on this to check where this comes from.