Verify our day/night scaling

dpb587 commented 10 years ago

Once we have a smaller cluster this becomes easier; just make sure the scaling script is working properly without heavy performance impact.

dpb587 commented 10 years ago

I tested this before and after Thursday business.

I scaled up from an incomplete snapshot; it took time to replicate the remaining half of the data:

...snip...
2014-02-05T04:03:48Z = we will be scaling up
...snip...
2014-02-05T04:03:50Z + updating stack...
...snip...
2014-02-05T04:10:04Z > node 85gptV-_T_qzfJJg-_TWHg (10.238.126.63) joined the cluster
...snip...
2014-02-05T05:59:10Z - cluster is 'green'

It was quick to scale down after hours...

...snip...
2014-02-06T17:35:26Z = we will be scaling down
...snip...
2014-02-06T17:35:29Z + updating stack...
...snip...
2014-02-06T17:36:22Z > node 2mFg3YYzS-GqGT9721c0Lw (10.238.126.63) left the cluster
...snip...
2014-02-06T17:36:34Z - cluster is 'green'

I'm confused why the node ID changed, but other than that it all behaved as expected.

dpb587 commented 10 years ago

I tested this again, but this time discovered an issue which should be fixed.

I scaled up from 2 nodes, 1 replica to 4 nodes, 2 replicas; it took some time for one node to reload from its fresh snapshot:

...snip...
2014-02-06T04:11:28Z = we will be scaling up
...snip...
2014-02-06T04:11:30Z + updating stack...
...snip...
2014-02-06T04:18:09Z > node f6u7d6NGQB-Mzm2e7OBlEg (10.239.23.98) joined the cluster
2014-02-06T04:18:41Z > node fIea437CS5m97U7SkEOH4Q (10.237.153.41) joined the cluster
...snip...
2014-02-06T05:04:37Z - cluster is 'green'

This time however, once the elasticsearch scaling settings were initially sent, it caused us to stop processing the queue until 04:43:30. I couldn't figure out why because I didn't notice it last night. Eventually I realized that with the additional replicas, elasticsearch wasn't indexing data until all replicas have been written (there is an async option to avoid this issue, but logstash isn't currently using it). By 04:43, both new nodes had allocated the logstash-2014.02.07 indices, and the queue was quickly emptied. I believe this can be fixed by updating the scaling script to send a cluster reroute command for the current day to get that index back online faster. However, I think there will continue to be a slight delay during scale up operations until the current date's index is allocated - I'll think on that more.

I went ahead and scaled down again too:

...snip...
2014-02-06T22:09:40 = we will be scaling down
...snip...
2014-02-06T22:09:43 + updating stack...
...snip...
2014-02-06T22:10:28 > node gW8rzew3TW2GP0uLftAZiQ (10.239.23.98) left the cluster
2014-02-06T22:10:28 > node puFR7m6TQECxrbfKL4fdHQ (10.237.153.41) left the cluster
...snip...
2014-02-06T22:10:39 - cluster is 'green'

mrdavidlaing commented 10 years ago

@dpb587 - most interesting; thanks for all this experimenting. I'm seeing that:

Adding a node takes a long time (hours)
Removing a node is quick (minutes)

In your experiments, how are you adding data to the new node? How much data is being copied?

Does it make any difference if you ensure that the day/night nodes reuse persistent volumes (ie, ones that shouldn't be more than 12 hours out of date)?

dpb587 commented 10 years ago

I was thinking about this more and I'm fairly certain the "cluster reroute" call wouldn't actually fix the delay issue I mentioned. Those shards are being initialized not relocated, and I haven't seen anywhere that suggests you can control the order shards are initialized - seems like it'd be an odd feature to have.

I have now realized what caused the lag to become an issue. By increasing from 1 replica to three, a quorum could not be met for data consistency, so elasticsearch was waiting until a third replica was online. This was not an issue on the first test because there was only an addition of one replica. This lag will not be an issue as long as we're only ever increasing replicas by one. The more likely scenario is to increase replicas by one and nodes by 2+, and that wouldn't cause lag either.

In regards to your note about it taking quite a long time, I agree. However, these two scaling sessions have been intentionally hefty with the first one requiring copying half the cluster's data and the second one starting from a completely new snapshot (which, noted during our previous deploy methods, means very slow initial loading).

They're currently using persistent volumes and when a volume is being reused, it is very quick to reload the metadata from disk. My next scaling test will be slightly more typical with 1 replica and 2 nodes, fully reusing existing data, but it may not be an accurate time test. The next test after it will be the most accurate allowing for some decisions to be made.

sopel commented 10 years ago

Great analysis, the quorum impact in particular. Looking forward to the results of the next tests accordingly - this stuff is most interesting indeed :)

the quorum aspect is also something where having more nodes in the game might reduce the dependencies/impact btw.

dpb587 commented 10 years ago

Just had another thought on a different scaling strategy which might be exciting. Example:

business hours (3 replicas, 6 nodes)
- replica 1: instance-1a with volume-1a; instance-1b with volume-1b
- replica 2: instance-2a with volume-2a; instance-2b with volume-2b
- replica 3: instance-3a with volume-3a; instance-3b with volume-3b
quiet hours (2 replicas, 2 nodes)
- replica 1: instance-1a with volume-1a and volume-2a; instance-1b with volume-1b and volume-2b
- replica 2: instances terminated; volumes reattached
- replica 3: instances terminated; volumes detached

Would require supporting multiple elasticsearch nodes on the same instance (overriding default memory limits and custom data directory paths), dynamically mounting volumes, and dynamically controlling services on the instances.

Update - named "multiple data volumes / node" strategy

dpb587 commented 10 years ago

Scaled up with 1 more replica, two nodes:

...snip...
2014-02-07T18:24:57Z = we will be scaling up
...snip...
2014-02-07T18:25:00Z + updating stack...
...snip...
2014-02-07T18:31:28Z > node rJnHQHKoTf2d4SwQN7JdAQ (10.236.90.230) joined the cluster
2014-02-07T18:31:28Z > node vnEgv_tORQmwXtbS7w3PjA (10.238.127.60) joined the cluster
...snip...
2014-02-07T18:41:15Z - cluster is 'green'

Was much better performance given both had existing volumes. One node started trashing it's duplicate replica data early (expected), requiring subsequent rebalancing.

Scale down was as expected:

...snip...
2014-02-07T21:09:43Z = we will be scaling down
...snip...
2014-02-07T21:09:48Z + updating stack...
...snip...
2014-02-07T21:10:33Z > node EX611JzHTQW2_NsX65YEUQ (10.238.127.60) left the cluster
2014-02-07T21:10:33Z > node BmEch7DJQG6v8WLmvf0cbQ (10.236.90.230) left the cluster
...snip...
2014-02-07T21:10:45Z - cluster is 'green'

I'll do the next scaling Monday night. It'll have additional weekend data to sync with, but it'll now be what I expect our "typical" case to look like and hope it to be fairly quick.

dpb587 commented 10 years ago

Scaled up with 1 more replica, two nodes:

...snip...
2014-02-10T21:02:17 = we will be scaling up
...snip...
2014-02-10T21:02:19 + updating stack...
...snip...
2014-02-10T21:08:44 > node MJlqa0SqReGhO3Hdu7MXeA (10.241.61.32) joined the cluster
2014-02-10T21:08:55 > node PBA5zA1MRcSuqy-lYKn8-A (10.238.158.150) joined the cluster
...snip...
2014-02-10T21:21:44 - cluster is 'green'

Took <20 minutes to get things loaded and synced back up with weekend data which is marginally better than what I was expecting.

Scale down was as expected.

...snip...
2014-02-10T21:36:17 = we will be scaling down
...snip...
2014-02-10T21:36:19 + updating stack...
...snip...
2014-02-10T21:37:05 > node 1rFgGPjrTD-LDVuqwYjx-A (10.241.61.32) left the cluster
2014-02-10T21:37:05 > node PBA5zA1MRcSuqy-lYKn8-A (10.238.158.150) left the cluster
...snip...
2014-02-10T21:37:17 - cluster is 'green'

mrdavidlaing commented 10 years ago

@dpb587 - a thought - these logs should be shipped into our labs cluster for long term analysis...

mrdavidlaing commented 10 years ago

@dpb587 - another thought - Can we run this script through our new Jenkins build server?

dpb587 commented 10 years ago

The script can run wherever we like so long as:

it has access to API credentials to perform the call
it has direct access to elasticsearch:9200 to make API configuration calls

mrdavidlaing commented 10 years ago

Super; the Jenkins build server fulfills both of those requirements (since it has a elastic IP).

Which version of Ruby?

dpb587 commented 10 years ago

I'm using 2.0.0p247 locally, but I think the scripts are very simple in regards to the version and should run on 1.9.3 as well.

cityindex-attic / logsearch

Verify our day/night scaling #327