Ontoportal global re-achitecture (2024)

Summary

This issue describes a roadmap and vision proposition for what needs to be updated in our current Architecture.

But first, to know why we need to change, I gave some points of the current existent issues and tried to guess why they were originally made as it is because of course people behind it taught about it and made their decision according to their time and needs at that time.

Global architecture feedbacks (Optional section)
- Do we need to separate the UI and API?
Frontend architecture feedbacks
- Do we need ontologies_api_client separated from the main UI code?
- Do we need MySQL?
- Do we need Memcache?
Backend architecture feedbacks
- Do we need ontologies_liked_data separated from the main API code?
- Do we need NCBO_CRON to be separated from the main API code?
- How can we handle better our Jobs?
- Do we need ncbo_ontology_recommender and ncbo_annotator separated from the main API code?
- Do we need multiple Redis stores?
- Do we index enough data with SOLR?
- Can we make Goo work with different TripleStores?
Action items

Goal

Simplifying our architecture implies making it easier to deploy, and maintain and allowing junior developers to iterate and develop more effectively. Below is the targeted architecture of this proposal, and in contrast you can find here the current architecture state

Action items (still to be continued)

In order of application

[ ] Integrate into the UI codebase, the project ontologies_api_client
[ ] Remove the project ontologies_api_client
[ ] Remove the usage of Mysql
[ ] Integrate into the API codebase, the two projects ontologies_linked_data and NCBO_CRON
[ ] Remove the projects ontologies_linked_data and NCBO_CRON
[ ] Re-architecture their business logic to extract them into jobs and add a job processor

Global

This section is not important as no action item will be acted on at the end, but it presents some interesting concepts that may be useful in the future

Do we need to separate the UI and API?

Context/Issue

In brief, no we don't need to have them separated but it will take too much time and effort now to merge them into one code base, and the benefit is not worth the effort.

In detail, rails our ruby framework used in the UI is capable of rendering HTML or JSON depending on what the client (an application or browser) asked Source: https://guides.rubyonrails.org/action_controller_overview.html#rendering

The benefit of having a unique code base is that it makes everything simpler for developers in the projects, the deployment, and the development environment, which would be immensely easier. As there is a unique component to deploy, run, and develop.

However, why did they come up with this decision in the beginning? I think maybe Rails could not render JSON at that time as the API mode was introduced in 2016 in version 5.0.0, as at that time (2013) they were at version 2.0.0. But the main idea behind I think was the "separation of concerns" principle, which stipulates that we need to separate applications into small programs, with their own goal. The "API" is for the business logic and the "UI" is for displaying the data and views. Making it easy to separate teams, and the technologies into frontend and backend.

Proposition

Nowadays, Rails is a full-stack framework that can handle JSON as HTML, and we can still have a good separation of concerns if we organize well our codes into Modules. But we will not do it this year as it will demand too much effort.

Frontend

Do we need `ontologies_api_client` separated from the main UI code

Context/Issue

In brief, _no we don't need to have ontoogies_api_client separated from the main UI code_

In detail, as described earlier we have separated our UI from the API, so each time the UI needs information it does an HTTP call to the API to get back the data in JSON that will be then rendered as HMTL, to make this process more convenient, an API client was made transforming ruby code as Ontology.all to HTTP call as GET https://data.agroportal.lirmm.fr/ontologies.

All of this is good and the API client was made cleverly, the unique issue that I had was it was separated from the main UI code, so we have the main code here https://github.com/ontoportal-lirmm/bioportal_web_ui/ and the API client here https://github.com/ontoportal-lirmm/ontologies_api_ruby_client. The Rails framework is heavily focused on the MVC (Model-Controller-View) design pattern, wherein in the model (e.g. Ontology) we implement all the business logic. In our case, this model layer does not exist as the main logic is in API, but still, we do some business logic in the UI also, typically because it is too troublesome to update the API and re-deploy for a small need, but with the accumulation and time, this small needs became, a lot of business logic done in the UI, which is still normal and not the real issue.

The real issue is where do we put this business logic accomplished by the UI, the response is the Models, and that was what was done in the begining, as we can see in the evolution of changes in the API client diagram and also in the code as we can find function like obsolete? for the class model.

But as time went on and due to the separation of the API client from the UI, we were encouraged to let behind the API client and put the logic a little bit everywhere, in the view helpers, in the controllers, inside the views, in javascript code.

So why this speciation in the biggin, the cause was that wanted to make the API key generic enough and external so that people could use it in their programs to fetch data from the API. alike the Python client and Java client, so that if I develop an application in ruby, python, or java, I will just plug in my package fetch the data, without rewriting their API client.

Proposition

Integrate into the main UI codebase the API Client, but still make as an extractable component (gem), so that we easily continue developing on it, and re-locate the logic into the models (more exactly into concerns/modules that will extend the models), and still share a public gem for people that want to use it (as the way the rails framework code source is structured, in submodules separately usable but in one main codebase).

Do we need `MySQL`?

No, it adds only additional complexity and is used only for licensing and appliance ID functionality (and other historical relics)

Do we need `Memcache`?

No, we can directly use Redis and have one less dependency to handle and maintain.

Backend

Do we need `ontologies_liked_data` separated from the main API code

Context/Issue

The ontologies_liked_data stands out as a crucial project within our overall architecture, originally conceived as the central hub for all business operations (as submission processing, and mapping computation). However, over time, the codebase has proliferated across various layers, with some functions implemented in the API controller layer, some in the NCBO CRON layer, and others in the UI.

In my opinion, this dispersion of code can be attributed, largely, to the separation of the ontologies_liked_data module from the main API project. This division has made it challenging (bothersome) to seamlessly switch between the two codebases, leading to a preference for directly implementing ad-hoc or duplicated functions within the API or CRON layers.

The initial reason for this separation was the project's widespread usage in other projects (cron, annotator, recommended), each existing as a distinct project. Consequently, ontologies_liked_data was needed to be a gem (importable module) in its repository. However, this approach has made development, understanding, and maintenance cumbersome, not to mention the issues related to code and test duplication.

Proposition

Proposing to merge the ontologies_liked_data project into the ontologies_api project. This can be achieved by adopting the below propositions, essentially advocating for the consolidation of all the backend projects (except goo) into a singular entity, the existing ontologies_api.

Do we need `NCBO_CRON` to be separated from the main API code

Context/Issue

The ncbo_cron is a backend service with two main objectives. First, it defines a list of jobs (tasks such as processing an ontology and removing submissions) and schedules them to run automatically periodically or ad hoc through scripts. Second, it provides an interactive shell for the Ruby console, allowing real-time testing/debugging of ad-hoc code using production data.

Originally conceived as an interface for the business logic handled by ontologies_liked_data, the ncbo_cron project has evolved to manage more than just running jobs. Currently, it includes functions like do_remote_ontology_pull(), which handles the entire process of nightly ontology pulls, from downloading and hash checking to submission creation. This logic logically belongs to the ontologies_liked_data project, as some functions are duplicated here.

Proposition

The proposal is to merge the ncbo_cron code into the ontologies_api project and also establish a clearer separation of concerns between the logic that should be in the ontologies_liked_data module and the job runner.

Certainly, the creation of a separate repository for this project was initially done for scalability reasons. Running ncbo_cron and ontologies_api on different servers helps prevent overloading the production API during resource-intensive processing by ncbo_cron. However, since most Ontoportal deployments are in single-server mode, scalability benefits are limited. Even in cases like Bioportal, where they may run on different servers, doing it in a unique code base is still feasible. This can be achieved by having the same project run in different modes, in one server as API mode and in the other as CRON mode.

How we handle our Jobs Currently

to be continued

Do we need `ncbo_ontology_recommender` and `ncbo_annotator` separated from the main API code

Not entirely certain about this, but let's first address the issue of keeping them separated. Firstly, because they exist in distinct projects, there is a lack of consistent maintenance. They are rarely revisited, resulting in infrequent updates or refactoring/cleaning efforts. With time this contributes also to a loss of expertise in these projects as forget about them.

On the other hand, merging them into the ontologies_api project may introduce a potential challenge. Given that both projects have relatively large code bases, won't this integration make the ontologies_api codebase unwieldy?

Do we need multiple `Redis` stores

It depends. For caching purposes, a single Redis instance might be sufficient for most applications. However, if Redis is used as the primary data store for multiple services, as is the case for us, employing multiple instances can enhance separation and scalability.

Currently, our code is generic enough to function based on either a single or multiple Redis instances, depending on the configurations.

The proposal here is to default to a single Redis instance in the distributed appliance. Depending on the portal's evolution, additional instances can be added as needed.

Do we index enough data with `SOLR`

Certainly not enough at the moment; currently, we are indexing only concepts/classes data and properties data—nothing more.

Why don't we expand our indexing to include more information, such as users, ontology acronyms and descriptions, submission metadata, etc.? Could it be due to concerns about performance? It's unclear, but I don't think so. We can constrain search queries to specific collections, making it easier to locate data. For example, if we are searching for ontologies, we can instruct Solr to search only the ontologies collection.

The proposition here is to index anything that needs to be searchable in the UI. To facilitate this, the Goo project needs to be updated to handle the indexation of Goo Models more seamlessly.

This change will be mandatory to have a working Ontoportal Federation ecosystem as we can't relate much in the Triple store to be efficient enough to handle that.

Can we make `Goo` work with different TripleStores?

The response is Yes, see #229 for more details.

Current Architecture

9 different GitHub repositories
3 Redis servers
1 Mysql server
1 Memcache server
1 SOLR server
1 Mgrep server

agroportal / project-management