Background and Motivation:

As part of the creation of a cloud deployment of the ETF using AWS, we put in place a load balancer to automatically create new instances of the application ir order to manage heavy workloads. To avoid consistency problems, and for every user to have the same information, we established a centralized data storage, to mount it on every ETF container.

However, the BaseX process locks down the data storage folder, and an exception is thrown if any other process tries to access it. This prevents the addition of new instance from the load balancer.

Proposed change

As proposed on the BaseX documentation, we propose to add an optional configuration for ETF deployment, from an embedded library to a client-server architecture. This way the database can be used concurrently by multiple ETF instances. The server can be hosted on a separate container, and the ETF instances can be configured to interact with this database through the configuration files.

This alternate interface should be developed on the data storage module of the ETF, to direct all the CRUD operations through the REST API of BaseX.

Alternatives

An alternative could be to deploy separated folders for each ETF instance, and establish a way to synchronize all the folder to avoid consistency issue across all the instances. This could be complex, and would mean to have a synchronization job scheduled with a high frequency, consuming a non trivial amount of resources.

Funding

N/A

Additional information

jonherrmann commented 5 years ago

When a client accesses the ETF instances, is the data always somehow retrieved from the BaseX database server? Doesn't this have a negative impact on performance (which seems very important for JRC)?

If not, then are you planning to use a cache in the ETF instances? If something is deleted in one ETF instance, will it be communicated somehow to the other instances and you are not expecting consistency issues?

Started test runs (not the persisted ones) do not have to be synchronized across instances?

This alternate interface should be developed on the data storage module of the ETF,

Which interface?

carlospzurita commented 5 years ago

Maybe the initial issue was a little obscure, so we will try to clarify.

BaseX holds everything database related on binary files in a folder. All the TestSuites, TestRuns (running or terminated), TestObjects, and TestResults are present on this database, and every time theres a request, the ETF uses it DataStorageService, that is setup with BaseX using the /etf/ds folder as a location to store the binary files. single-instance What we tried to do is, in order to keep running several instances of the ETF and synchronize them, is to separate the data store in a common location, and use it to setup each of the DataStorageService. But we found that the BaseX process locks this folder on startup, and doesn't let other process interact with these files. It caused an exception that prevented a new instance to start, and made the original instance to crash. multiple-instance Reading through the BaseX documentation, we found that to perform concurrent operation on the same database, it is needed to use the client-server architecture. So what we propose is to add an alternate interface, through a REST API, for the DataStorageService to access the database, and to setup a BaseX server separated from all the ETF instances. basex-server This doesn't have to replace the current approach, but could be an useful configuration option.

jonherrmann commented 5 years ago

Thanks for the diagrams, I have some comments:

If the big boxes represent the layers/modules, than the DataStorageService is in the wrong box. It is not part of the CORE
The DAOs do not call the DataStorageService, but vice versa the DataStorageService serves the DAOs
Therefore, if the DataStorageService would call the BaseX Server directly, it would violate the principles of the layered architecture
I am still not sure where the "main" changes will be made. Is there a new module?
Why is REST used as protocol and not the BaseX own client/server protocol?
The last diagram shows the filesystem in an EFS box, which would then no longer be needed. Probably an oversight?
I cannot find answers to my original questions about caching, consistency and performance
As you have already written, everything is stored in the database. XML-Data Test runs are currently being performed in the database as well. I can't find any information about this either: should this also be done in the BaseX server in the future -in one single BaseX server? What is the advantage of the whole approach if the bottleneck is the database?

carlospzurita commented 5 years ago

Thanks for your comments Jon, I'll try to clarify a bit more to address your issues

If the big boxes represent the layers/modules, than the DataStorageService is in the wrong box. It is not part of the CORE

Yes, we were aware that this is part of another component. We wanted to depict the from the webapp related operations, from the backend operations of persistence.

The DAOs do not call the DataStorageService, but vice versa the DataStorageService serves the DAOs Therefore, if the DataStorageService would call the BaseX Server directly, it would violate the principles of the layered architecture

Ok, noted. We were not intending to alter the layered architecture, it was just a misunderstanding

I am still not sure where the "main" changes will be made. Is there a new module?

The main changes should be made on the etf-bsxds, either modifying the BsXQuery class to issue the queries through the BaseX server, or adding a new class to handle this. The use of one or another alternative can be controlled from the configuration files, so there would be some changes needed on the EtfConfigController class, on the etf-webapp.

Why is REST used as protocol and not the BaseX own client/server protocol?

To interact with the BaseX server, there are two alternatives. Using a CLI to connect to it, or use the REST or RESTXQ over HTTP to issue the queries.

The last diagram shows the filesystem in an EFS box, which would then no longer be needed. Probably an oversight?

The EFS would still be in place in order to provide an elastic storage for the ETF. Of course, this is completely optional and a traditional disk can be used.

I cannot find answers to my original questions about caching, consistency and performance

Leaving the cache on the ETF instances would be quite inefficient, given that everytime a resource is requested it should check that it is still valid and hasn't changed on the database. The alternative is to rely on the BaseX internal cache, and make an HTTP request everytime a query is issued. Either way, the performance on CRUD operations would be affected.

As you have already written, everything is stored in the database. XML-Data Test runs are currently being performed in the database as well. I can't find any information about this either: should this also be done in the BaseX server in the future -in one single BaseX server? What is the advantage of the whole approach if the bottleneck is the database?

To be able to execute BaseX tests, the BsxTestDriver should also be modified to execute the queries over HTTP requests, in the same manner as in the datastore.
The main improvement that this change would achieve is to be able to handle much more requests in parallel, avoiding queuing issues. This comes with a tradeoff of slower read/write operations, but we think that it could be beneficial to process more requests, even if they take more time to execute.

jonherrmann commented 4 years ago

❏ Contractors to build a prototype and test in scenarios whether the test can be run faster with the alternative data storage architecture https://github.com/etf-validator/governance/blob/master/Meetings/SG/20190618.adoc#eips-for-sg-discussion

etf-validator / governance

Use of BaseX with client-server architecture #75

Background and Motivation:

Proposed change

Alternatives

Funding

Additional information