Handling HANA scale-out/failover usecases

elturkym commented 3 years ago

SCALE-OUT/FAILOVER

The idea is to run the exporter on all the hana-db hosts.
The exporter on each host will make sure that the master host is up.

On Master host, the exporter will emit the HANA system metrics, the exporters on the other hosts will be on standby mode, and will reply only with python metrics to any curl command.

The exporter installation will fail on all nodes, if it is not able to connect to master node.

To enable this mode one new config is added scale_out_mode a boolean flag to enable this feature, it is false by default.

Examples:

Standby mode on worker nodes

INFO:shaptools.hdb_connector.connectors.base_connector:query records: [('imdbworker02',), ('imdbworker01',), ('imdbmaster',)]
INFO:hanadb_exporter.db_manager:Current HANA system hosts: ['imdbworker02', 'imdbworker01', 'imdbmaster']
current_host imdbworker02
INFO:hanadb_exporter.main:scale_out_mode mode is enabled
INFO:hanadb_exporter.main:Exporter is in stand by mode for scale-out handling
INFO:shaptools.hdb_connector.connectors.base_connector:executing sql query: SELECT HOST from M_LANDSCAPE_HOST_CONFIGURATION WHERE HOST_ACTIVE='YES'
INFO:shaptools.hdb_connector.connectors.base_connector:query records: [('imdbworker02',), ('imdbworker01',), ('imdbmaster',)]
INFO:hanadb_exporter.db_manager:Current HANA system hosts: ['imdbworker02', 'imdbworker01', 'imdbmaster']
INFO:hanadb_exporter.main:starting to serve metrics

curl response:

[ec2-user@imdbworker02 ~]$ curl localhost:9668
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 215.0
python_gc_objects_collected_total{generation="1"} 326.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 91.0
python_gc_collections_total{generation="1"} 8.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="8",patchlevel="4",version="3.8.4"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.07556096e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3.3619968e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.62826749985e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.22999999999999998
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0

Installation failure due to connection failure to system-db on all host

ERROR:hanadb_exporter.db_manager:the connection to the system database failed. error message: connection failed: [Errno -2] Name or service not known
Traceback (most recent call last):
File "/home/ec2-user/.local/bin/hanadb_exporter", line 9, in <module>
main.run()
File "/home/ec2-user/.local/lib/python3.8/site-packages/hanadb_exporter/main.py", line 189, in run
start(
File "/home/ec2-user/.local/lib/python3.8/site-packages/hanadb_exporter/main.py", line 117, in start
db_manager.start(
File "/home/ec2-user/.local/lib/python3.8/site-packages/hanadb_exporter/db_manager.py", line 138, in start
raise hdb_connector.connectors.base_connector.ConnectionError(
shaptools.hdb_connector.connectors.base_connector.ConnectionError: timeout reached connecting the System database

elturkym commented 3 years ago

Hi @arbulu89,

This was my initial thought to split the changes into two PRs. One PR helps to avoid conflicts, but I will split them for better reviewing and scoping.
I missed that there is UT test package, I am adding UT for sure, and will address other comments.

Thanks,

Mohamed

elturkym commented 3 years ago

I have moved the secrets manager changes to this new pull request https://github.com/SUSE/hanadb_exporter/pull/97 as recommended.

I will keep this PR for scale-outs handling

elturkym commented 3 years ago

Hi @arbulu89,

I have updated this PR with the new modification for scale-out handling, I used the same PR to keep the conversation history.

This approach depends on starting the export with master host:

The export will fetch all the hosts.
The export on Master Node (or if it is on a standalone machine) will only register the database connectors.
Worker nodes will just test the master connection is up or not.
When the master connection fail and the standby host becomes master, all the hosts will fetch the new active host with system database, it will call db.start again with new master host.
This approach doesn't block the http calls as well.

Unit-test and documentation are remaining, but I will add them while addressing any comments from you about these changes.

arbulu89 commented 3 years ago

Hi @elturkym , I'm back at work. I will have a look on this on these 1st days of the week and get you back with my feedback.

elturkym commented 3 years ago

Hi @elturkym , Many things commented below. I think we need to rethink many things.

I think many parts of the code must be replaced to the database manager, which should handle scale out connections

I have replied back to all the comments and we can discuss offline

We should most probably return some metric for standby nodes, otherwise they just don't do anything, and I don't know why we should collect their information

it returns python metrics python_info{implementation="CPython",major="3",minor="8",patchlevel="4",version="3.8.4"} 1.0 as mentioned in the description

I don't really like the idea to connect to the master node from all the active nodes, if this is the case, don't they return all the same values? Is this something logical? (or am I missing something?). If the data is duplicated, maybe we should just return a metric saying that their role and only return data from the master (the first thing that came to my mind)

I am expecting to have only active master node at the time, the data should be duplicated. currently workers will return some python metrics as mentioned in the description, let it me if you think we should add specific metric to tell it is standby node I am not sure what is the cost of that.

lee-martin commented 2 years ago

@arbulu89 and @stefanotorresi Is there any update on where we are with this? We had planned to ship this and https://github.com/SUSE/hanadb_exporter/pull/97 , see https://jira.suse.com/browse/SLE-20632 .

elturkym commented 2 years ago

Hi All,

I am going to close this PR, since the priorities have changed in our end, I won't be able to continue working in this feature during this year.

Thanks so much for helping me in this PR and https://github.com/SUSE/hanadb_exporter/pull/97 as well

Looking forward to working with you again.

Best regards,

Mohamed Elturky

SUSE / hanadb_exporter

Handling HANA scale-out/failover usecases #94

SCALE-OUT/FAILOVER

Examples: