freme-project / technical-discussion

This repository is used for technical discussions.
2 stars 0 forks source link

FREME stability / backup solutions #59

Closed jnehring closed 9 years ago

jnehring commented 9 years ago

@koidl raised the question on what to do if FREME live does not work. He asked for a backup solution.

First of all we have two installations of FREME which are freme-live and freme-dev. The two installations are completely separate from each other. freme-dev is instable by definition so these considerations here hold only for freme-live.

I think the solution is that FREME live has to work all the time. There should not be a backup API like freme-live-backup.

Here I investigate some reasons of failure for FREME and discuss our counter measures.

Crash of FREME NER

Soon we will have a distributed setup with one broker and one or many instances of FREME NER. I plan to use spring cloud which distributes the load on many FREME NER instances. It is smart, meaning that it automatically detects when a new instance of FREME NER starts up or when an instance of FREME NER crashes. This is a self healing system. So it is only fatal when all instances of FREME NER crash. I expect this setup to come with version 0.4.

Crash of Broker

Then there is still a single point of failure which is a crashing broker. We can also think of having multiple brokers, but this requires some code changes and an additional load balancer. Right now the broker has a very resilient setup against crashes in invidual API calls, a crash in an individual API call will crash only this call and not the whole system. I think the most / only likely reason for a crash is overload: We will investigate upon this when we do the big data analysis.

Crash of MySQL database

When we also have eliminated this single point of failure then there is only one thing that is not redundant which is the MySQL database. But: MySQL is a very reliant technology that it should not fail. I have used MySQL databases many times and they never failed. One can make a MySQL database also redundant.

Other reasons for crashes

FREME can crash due to reasons like bad programming or a hacker attack. These two sources for crash can never be totally eliminated. But through our extensive quality management (especially long testing periods, code reviews, unit tests, integration tests) we minimize these risks.

fsasaki commented 9 years ago

Hi @jnehring , thanks for the detailed write up. One comment: you write "The two installations are completely separate from each other." This is not 100% because three services use technology that we have no control over: e-Link -> dbpedia spotlight, e-Terminology and e-Translation. My point in https://github.com/freme-project/e-Entity/issues/39#issuecomment-140408473 was that it would be good to have FREME NER stable in the live version. What you write above is the technical solution via several NER instances, that's great. One item for our service level agreemnt (cf. @koidl and @tatgor) is that we need to explain this stabilty expecations, also for the Tilde based services. Just a reminder to myself.