IDR / deployment

Deployment infrastructure for the Image Data Resource
https://idr.openmicroscopy.org/about/deployment.html
BSD 2-Clause "Simplified" License
13 stars 14 forks source link

Support multiple Elasticsearch nodes and create backup #387

Closed khaledk2 closed 1 year ago

sbesson commented 2 years ago

Two syntactic warnings

WARNING  Listing 2 violation(s) that are fatal
    [206] Variables should have spaces before and after: {{ var_name }}
    idr-searchengine.yml:110
        with_sequence: start=1 count={{ elasticsearch_no_nodes}}

    [206] Variables should have spaces before and after: {{ var_name }}
    idr-searchengine.yml:194
        with_sequence: start=2 count={{ elasticsearch_no_nodes | int -1}}

Otherwise as discussed on last Monday IDR call, the plan is to recreate a new idr-testing environment (test112) after the upcoming prod111 release and include this PR in the deployment.

khaledk2 commented 2 years ago

@will-moore The elastic search cluster has 3 nodes. The first step is checking the cluster. You can ssh test112-searchengine and then run the following command:

curl localhost:9201/_cluster/health?pretty

The results should be like that:

{
  "cluster_name" : "searchengine-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2,
  "active_shards" : 4,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

You can check the cluster statuswhich should be green and the number_of_nodes which should be 3. You may repeat the same command but instead of 9201you may use 9202 or 9203and you should get the same results.

You may also run the indexing. You should use the indexing command but you should replace the openmicroscopy/omero-searchengine:0.3.1 image with khaledk2/searchengine:latest which supports the elasticsearch cluster.

khaledk2 commented 2 years ago

This PR should work with the SearchEngine PR 65

The cluster has three nodes. Each of them is a master and a data node. The searchengine is configured to connect to the elasticsearch using a list containing the three nodes. It will try to connect to the first node in the list; if it is down, it will try to connect to the second node; if it is down, it will try to connect to the last node. The cluster itself is up if at least two nodes are running. I have created some scripts to help the tester to go through that. The PR should be tested in test112-searchengine. The scripts are saved in the /data/searchengine/test_package/ folder. The check_cluster_health.sh script is used to check the cluster status at any time. The searchEngine functions can be tested using the ider-testing website http://idr-testing.openmicroscopy.org/ Alternatively, the reviewer can use test_searchengine.sh script to test searchEngine functions. The script takes about 15 minutes to finish. The script output is saved to a text file check_report.txt in the/data/searchengine/searchengine/ folder. Test cases: If the data folder is corrupted or deleted, restoring Elasticsearch index data is possible by using the search engine restore function. The tester can test restoring the cluster by stoping the elasticsearch nodes usingthe following commands

bash stop_node.sh 1
bash stop_node.sh 2
bash stop_node.sh 3

Then he can delete the Elasticsearch data folder using the following command: bash delete_elasticsearch_nodes_data_folders.sh Run this playbook command to restore the Elasticsearch cluster but without data: sudo ansible-playbook idr-searchengine.yml The tester can restore the Elasticsearch data from the backup (snapshot) using the following command: bash restore_elasticsearch_data.sh It may take up to 15 minutes to restore the data. Testing the cluster: Stop one Elasticsearch node and check the searchEngine. The following command will stop node 1: bash stop_node.sh 1 SearchEngine should work fine and the Elasticsearch cluster should work with the remaining two nodes; the status of the Elasticsearch cluster will be yellow for a while then it will turn green. Stop another Elasticsearch node. The following command will stop node 2: bash stop_node.sh 2 The cluster should be down but the searchEngine should still work fine. Restoring the cluster nodes by running an ELasticsearch node, for example, to run node 1: bash run_node1.sh So, even if two nodes are down, the searchEngine can still function. It is possible to run the second node using the following command: bash run_node2.sh

sbesson commented 1 year ago

While reviewing this with @jburel, I realised it will require a new search engine backend and effectively has an undocumented dependency on https://github.com/ome/omero_search_engine/pull/61 which contains the relevant code to work with multiple nodes. @khaledk2 it is really critical we are transparent as a team about these cross-repository dependencies. Otherwise we run into the risk of deploying broken components to production environments. Ideally this should be stated in the PR description itself.

Practically, proposed next steps are:

jburel commented 1 year ago

general comment The bash scripts under /data/searchengine/test_package/test_scripts/ are useful for evaluating the system but they are not in any repository as far as I can see They should be added somewhere and some adjusted to make them configurable

jburel commented 1 year ago

After stopping one node:

curl localhost:9203/_cluster/health?pretty
{
  "cluster_name" : "searchengine-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 9,
  "active_shards" : 18,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

I can see one node being down. But the status is marked as green , it never turned yellow then green

Edit: maybe I was too slow, it turns green in follow-up checks

jburel commented 1 year ago

I stopped a second node:

 sudo docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED        STATUS        PORTS                                                                                  NAMES
a966fcbd8331   docker.elastic.co/elasticsearch/elasticsearch:7.16.2   "/bin/tini -- /usr/l…"   11 hours ago   Up 11 hours   0.0.0.0:9203->9200/tcp, :::9203->9200/tcp, 0.0.0.0:9303->9300/tcp, :::9303->9300/tcp   searchengine_elasticsearch_node3
c24e6f17f246   khaledk2/searchengine:test                             "bash run_app.sh run…"   16 hours ago   Up 16 hours   0.0.0.0:5577->5577/tcp, :::5577->5577/tcp, 8080/tcp                                    searchengine

Then

curl localhost:9203/_cluster/health?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

I understood that each node is a master so it should still be running

jburel commented 1 year ago

restarted node 1

 curl localhost:9203/_cluster/health?pretty
{
  "cluster_name" : "searchengine-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 9,
  "active_shards" : 18,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 7,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 4,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 4021,
  "active_shards_percent_as_number" : 72.0
}
jburel commented 1 year ago

Then restarted node2

curl localhost:9203/_cluster/health?pretty
{
  "cluster_name" : "searchengine-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 9,
  "active_shards" : 24,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 96.0
}
jburel commented 1 year ago

Clarification: The cluster is down after stopping 2 nodes but the search is still working with one node

jburel commented 1 year ago

Tested various aspects of the set-up:

Overall we can integrate that work.

jburel commented 1 year ago

omero_search_engine to be bumped to 0.4.1

jburel commented 1 year ago

@khaledk2 could you update the version cf. https://github.com/IDR/deployment/blob/master/ansible/group_vars/searchengine-hosts.yml#L9?

khaledk2 commented 1 year ago

The version has been updated to 0.4.1.

jburel commented 1 year ago

@sbesson I think we can go ahead.