Closed khaledk2 closed 1 year ago
@will-moore The elastic search cluster has 3 nodes. The first step is checking the cluster. You can ssh test112-searchengine
and then run the following command:
curl localhost:9201/_cluster/health?pretty
The results should be like that:
{
"cluster_name" : "searchengine-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 2,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
You can check the cluster status
which should be green
and the number_of_nodes
which should be 3.
You may repeat the same command but instead of 9201
you may use 9202
or 9203
and you should get the same results.
You may also run the indexing. You should use the indexing command but you should replace the openmicroscopy/omero-searchengine:0.3.1
image with khaledk2/searchengine:latest
which supports the elasticsearch cluster.
This PR should work with the SearchEngine PR 65
The cluster has three nodes. Each of them is a master and a data node.
The searchengine is configured to connect to the elasticsearch using a list containing the three nodes.
It will try to connect to the first node in the list; if it is down, it will try to connect to the second node; if it is down, it will try to connect to the last node.
The cluster itself is up if at least two nodes are running.
I have created some scripts to help the tester to go through that.
The PR should be tested in test112-searchengine
. The scripts are saved in the /data/searchengine/test_package/
folder.
The check_cluster_health.sh
script is used to check the cluster status at any time.
The searchEngine functions can be tested using the ider-testing
website
http://idr-testing.openmicroscopy.org/
Alternatively, the reviewer can use test_searchengine.sh
script to test searchEngine functions. The script takes about 15 minutes to finish. The script output is saved to a text file check_report.txt
in the/data/searchengine/searchengine/
folder.
Test cases:
If the data folder is corrupted or deleted, restoring Elasticsearch index data is possible by using the search engine restore function.
The tester can test restoring the cluster by stoping the elasticsearch nodes usingthe following commands
bash stop_node.sh 1
bash stop_node.sh 2
bash stop_node.sh 3
Then he can delete the Elasticsearch data folder using the following command:
bash delete_elasticsearch_nodes_data_folders.sh
Run this playbook command to restore the Elasticsearch cluster but without data:
sudo ansible-playbook idr-searchengine.yml
The tester can restore the Elasticsearch data from the backup (snapshot) using the following command:
bash restore_elasticsearch_data.sh
It may take up to 15 minutes to restore the data.
Testing the cluster:
Stop one Elasticsearch node and check the searchEngine. The following command will stop node 1:
bash stop_node.sh 1
SearchEngine should work fine and the Elasticsearch cluster should work with the remaining two nodes; the status of the Elasticsearch cluster will be yellow for a while then it will turn green.
Stop another Elasticsearch node. The following command will stop node 2:
bash stop_node.sh 2
The cluster should be down but the searchEngine should still work fine.
Restoring the cluster nodes by running an ELasticsearch node, for example, to run node 1:
bash run_node1.sh
So, even if two nodes are down, the searchEngine can still function.
It is possible to run the second node using the following command:
bash run_node2.sh
While reviewing this with @jburel, I realised it will require a new search engine backend and effectively has an undocumented dependency on https://github.com/ome/omero_search_engine/pull/61 which contains the relevant code to work with multiple nodes. @khaledk2 it is really critical we are transparent as a team about these cross-repository dependencies. Otherwise we run into the risk of deploying broken components to production environments. Ideally this should be stated in the PR description itself.
Practically, proposed next steps are:
searchengine
VM from the HEAD of https://github.com/ome/omero_search_engine/pull/61 and use this containeromero_search_engine
tag to 0.4.0
general comment
The bash scripts under /data/searchengine/test_package/test_scripts/
are useful for evaluating the system but they are not in any repository as far as I can see
They should be added somewhere and some adjusted to make them configurable
After stopping one node:
curl localhost:9203/_cluster/health?pretty
{
"cluster_name" : "searchengine-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 9,
"active_shards" : 18,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
I can see one node being down. But the status is marked as green
, it never turned yellow
then green
Edit: maybe I was too slow, it turns green in follow-up checks
I stopped a second node:
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a966fcbd8331 docker.elastic.co/elasticsearch/elasticsearch:7.16.2 "/bin/tini -- /usr/l…" 11 hours ago Up 11 hours 0.0.0.0:9203->9200/tcp, :::9203->9200/tcp, 0.0.0.0:9303->9300/tcp, :::9303->9300/tcp searchengine_elasticsearch_node3
c24e6f17f246 khaledk2/searchengine:test "bash run_app.sh run…" 16 hours ago Up 16 hours 0.0.0.0:5577->5577/tcp, :::5577->5577/tcp, 8080/tcp searchengine
Then
curl localhost:9203/_cluster/health?pretty
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
I understood that each node is a master so it should still be running
restarted node 1
curl localhost:9203/_cluster/health?pretty
{
"cluster_name" : "searchengine-cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 9,
"active_shards" : 18,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 7,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 4,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 4021,
"active_shards_percent_as_number" : 72.0
}
Then restarted node2
curl localhost:9203/_cluster/health?pretty
{
"cluster_name" : "searchengine-cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 9,
"active_shards" : 24,
"relocating_shards" : 0,
"initializing_shards" : 1,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 96.0
}
Clarification: The cluster is down after stopping 2 nodes but the search is still working with one node
Tested various aspects of the set-up:
Overall we can integrate that work.
omero_search_engine
to be bumped to 0.4.1
@khaledk2 could you update the version cf. https://github.com/IDR/deployment/blob/master/ansible/group_vars/searchengine-hosts.yml#L9?
The version has been updated to 0.4.1.
@sbesson I think we can go ahead.
Two syntactic warnings
Otherwise as discussed on last Monday IDR call, the plan is to recreate a new idr-testing environment (
test112
) after the upcomingprod111
release and include this PR in the deployment.