etf-validator / governance

ETF Steering Group and the Technical Committee documents
1 stars 2 forks source link

Use of BaseX with client-server architecture #75

Open carlospzurita opened 5 years ago

carlospzurita commented 5 years ago

ETF Improvement Proposal (EIP)

Background and Motivation:

As part of the creation of a cloud deployment of the ETF using AWS, we put in place a load balancer to automatically create new instances of the application ir order to manage heavy workloads. To avoid consistency problems, and for every user to have the same information, we established a centralized data storage, to mount it on every ETF container.

However, the BaseX process locks down the data storage folder, and an exception is thrown if any other process tries to access it. This prevents the addition of new instance from the load balancer.

Proposed change

As proposed on the BaseX documentation, we propose to add an optional configuration for ETF deployment, from an embedded library to a client-server architecture. This way the database can be used concurrently by multiple ETF instances. The server can be hosted on a separate container, and the ETF instances can be configured to interact with this database through the configuration files.

This alternate interface should be developed on the data storage module of the ETF, to direct all the CRUD operations through the REST API of BaseX.

Alternatives

An alternative could be to deploy separated folders for each ETF instance, and establish a way to synchronize all the folder to avoid consistency issue across all the instances. This could be complex, and would mean to have a synchronization job scheduled with a high frequency, consuming a non trivial amount of resources.

Funding

N/A

Additional information

jonherrmann commented 5 years ago

When a client accesses the ETF instances, is the data always somehow retrieved from the BaseX database server? Doesn't this have a negative impact on performance (which seems very important for JRC)?

If not, then are you planning to use a cache in the ETF instances? If something is deleted in one ETF instance, will it be communicated somehow to the other instances and you are not expecting consistency issues?

Started test runs (not the persisted ones) do not have to be synchronized across instances?

This alternate interface should be developed on the data storage module of the ETF,

Which interface?

carlospzurita commented 5 years ago

Maybe the initial issue was a little obscure, so we will try to clarify.

BaseX holds everything database related on binary files in a folder. All the TestSuites, TestRuns (running or terminated), TestObjects, and TestResults are present on this database, and every time theres a request, the ETF uses it DataStorageService, that is setup with BaseX using the /etf/ds folder as a location to store the binary files. single-instance What we tried to do is, in order to keep running several instances of the ETF and synchronize them, is to separate the data store in a common location, and use it to setup each of the DataStorageService. But we found that the BaseX process locks this folder on startup, and doesn't let other process interact with these files. It caused an exception that prevented a new instance to start, and made the original instance to crash. multiple-instance Reading through the BaseX documentation, we found that to perform concurrent operation on the same database, it is needed to use the client-server architecture. So what we propose is to add an alternate interface, through a REST API, for the DataStorageService to access the database, and to setup a BaseX server separated from all the ETF instances. basex-server This doesn't have to replace the current approach, but could be an useful configuration option.

jonherrmann commented 5 years ago

Thanks for the diagrams, I have some comments:

carlospzurita commented 5 years ago

Thanks for your comments Jon, I'll try to clarify a bit more to address your issues

If the big boxes represent the layers/modules, than the DataStorageService is in the wrong box. It is not part of the CORE

The DAOs do not call the DataStorageService, but vice versa the DataStorageService serves the DAOs Therefore, if the DataStorageService would call the BaseX Server directly, it would violate the principles of the layered architecture

I am still not sure where the "main" changes will be made. Is there a new module?

Why is REST used as protocol and not the BaseX own client/server protocol?

The last diagram shows the filesystem in an EFS box, which would then no longer be needed. Probably an oversight?

I cannot find answers to my original questions about caching, consistency and performance

As you have already written, everything is stored in the database. XML-Data Test runs are currently being performed in the database as well. I can't find any information about this either: should this also be done in the BaseX server in the future -in one single BaseX server? What is the advantage of the whole approach if the bottleneck is the database?

jonherrmann commented 4 years ago

❏ Contractors to build a prototype and test in scenarios whether the test can be run faster with the alternative data storage architecture https://github.com/etf-validator/governance/blob/master/Meetings/SG/20190618.adoc#eips-for-sg-discussion