Support multiple Elasticsearch nodes and create backup

sbesson commented 2 years ago

Two syntactic warnings

WARNING  Listing 2 violation(s) that are fatal
    [206] Variables should have spaces before and after: {{ var_name }}
    idr-searchengine.yml:110
        with_sequence: start=1 count={{ elasticsearch_no_nodes}}

    [206] Variables should have spaces before and after: {{ var_name }}
    idr-searchengine.yml:194
        with_sequence: start=2 count={{ elasticsearch_no_nodes | int -1}}

Otherwise as discussed on last Monday IDR call, the plan is to recreate a new idr-testing environment (test112) after the upcoming prod111 release and include this PR in the deployment.

khaledk2 commented 2 years ago

@will-moore The elastic search cluster has 3 nodes. The first step is checking the cluster. You can ssh test112-searchengine and then run the following command:

curl localhost:9201/_cluster/health?pretty

The results should be like that:

{
  "cluster_name" : "searchengine-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2,
  "active_shards" : 4,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

You can check the cluster statuswhich should be green and the number_of_nodes which should be 3. You may repeat the same command but instead of 9201you may use 9202 or 9203and you should get the same results.

You may also run the indexing. You should use the indexing command but you should replace the openmicroscopy/omero-searchengine:0.3.1 image with khaledk2/searchengine:latest which supports the elasticsearch cluster.

khaledk2 commented 2 years ago

This PR should work with the SearchEngine PR 65

The cluster has three nodes. Each of them is a master and a data node. The searchengine is configured to connect to the elasticsearch using a list containing the three nodes. It will try to connect to the first node in the list; if it is down, it will try to connect to the second node; if it is down, it will try to connect to the last node. The cluster itself is up if at least two nodes are running. I have created some scripts to help the tester to go through that. The PR should be tested in test112-searchengine. The scripts are saved in the /data/searchengine/test_package/ folder. The check_cluster_health.sh script is used to check the cluster status at any time. The searchEngine functions can be tested using the ider-testing website http://idr-testing.openmicroscopy.org/ Alternatively, the reviewer can use test_searchengine.sh script to test searchEngine functions. The script takes about 15 minutes to finish. The script output is saved to a text file check_report.txt in the/data/searchengine/searchengine/ folder. Test cases: If the data folder is corrupted or deleted, restoring Elasticsearch index data is possible by using the search engine restore function. The tester can test restoring the cluster by stoping the elasticsearch nodes usingthe following commands

bash stop_node.sh 1
bash stop_node.sh 2
bash stop_node.sh 3

Then he can delete the Elasticsearch data folder using the following command: bash delete_elasticsearch_nodes_data_folders.sh Run this playbook command to restore the Elasticsearch cluster but without data: sudo ansible-playbook idr-searchengine.yml The tester can restore the Elasticsearch data from the backup (snapshot) using the following command: bash restore_elasticsearch_data.sh It may take up to 15 minutes to restore the data. Testing the cluster: Stop one Elasticsearch node and check the searchEngine. The following command will stop node 1: bash stop_node.sh 1 SearchEngine should work fine and the Elasticsearch cluster should work with the remaining two nodes; the status of the Elasticsearch cluster will be yellow for a while then it will turn green. Stop another Elasticsearch node. The following command will stop node 2: bash stop_node.sh 2 The cluster should be down but the searchEngine should still work fine. Restoring the cluster nodes by running an ELasticsearch node, for example, to run node 1: bash run_node1.sh So, even if two nodes are down, the searchEngine can still function. It is possible to run the second node using the following command: bash run_node2.sh

sbesson commented 1 year ago

While reviewing this with @jburel, I realised it will require a new search engine backend and effectively has an undocumented dependency on https://github.com/ome/omero_search_engine/pull/61 which contains the relevant code to work with multiple nodes. @khaledk2 it is really critical we are transparent as a team about these cross-repository dependencies. Otherwise we run into the risk of deploying broken components to production environments. Ideally this should be stated in the PR description itself.

Practically, proposed next steps are:

first the search engine PR should be simplified and limited it to the minimal set of functional changes required for its evaluation (https://github.com/ome/omero_search_engine/pull/61#pullrequestreview-1195648560) in order to facilitate the job of the reviewers and testers
once this is done, we can build a local image on the searchengine VM from the HEAD of https://github.com/ome/omero_search_engine/pull/61 and use this container
this should give us both the multi-node elasticsearch cluster as well as the search engine and we can follow the detailed testing scenarios described in https://github.com/IDR/deployment/pull/387#issuecomment-1313335151
once approved, the upstream PR can be merged, the repo tagged and this PR will require another commit bumping the omero_search_engine tag to 0.4.0