Use only absolute time in `UnassignedInfo`

elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine

https://www.elastic.co/products/elasticsearch

Other

1.22k stars 24.85k forks source link

Use only absolute time in `UnassignedInfo` #95800

Open DaveCTurner opened 1 year ago

DaveCTurner commented 1 year ago

Today we initialise UnassignedInfo#unassignedTimeNanos to System.nanoTime() locally on every node and use this for computing delayed allocation. This number is not sent over the wire (which makes sense as it's not comparable across nodes) so different nodes will have subtly different ClusterState values due to this.

I propose we drop this behaviour and instead rely on UnassignedInfo#unassignedTimeMillis for computing delayed allocation. This number is an absolute time so can be shared over the wire, and it'll work the same as long as the master nodes all have clocks that are well-enough synchronized. We rely on master nodes having accurate clocks in various other places (ILM and SLM scheduling for instance) and even if they don't then nothing unsafe will happen.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen commented 1 year ago

Only using millis for that makes sense to me, also avoids problems like:

https://github.com/elastic/elasticsearch/blob/db17d38c55938f40e78f30468fbf690c94a2932e/server/src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java#L454

(that comparison is in principle unsafe).

However, the nanos were explicitly introduced in https://github.com/elastic/elasticsearch/pull/14808/files#diff-eeb1b984a52beb0adebf345c84952921f7797499bca02d79f45241927835508a and I am not exactly sure I understand why?

DaveCTurner commented 1 year ago

I am not exactly sure I understand why?

Nor I. But to speculate: 2015 was around the time that folks were running Jepsen tests against Elasticsearch, and possibly these delays helped avoid some anomalies and/or avoided some invalid assumptions about clock synchronisation. I think the reasons are lost in time these days however. (I can't think of a security issue in this area, although if there were then this'd be a good way to fix it quietly - maybe @tvernum would know?)

volodk85 commented 1 year ago

I think I know why nanoTime is used here. In short, it is because of its monotonically increasing properly to schedule delayed allocation. Key moving components involved here are 1) AllocationService 2) ShardAllocator and 3) DelayedAllocationService

If AllocationService detects delayed shards, shard allocator does not move forward with a NO decision Then DelayedAllocationService comes into play.

It calculates delay time (X nanos) and schedules its next execution after X nanos. Important details here that delay calculation is based on System.nanoTime. Taking its non-decreasing property (within same JVM) it is guaranteed that eventually Unassigned delay drops to 0 and shard will be moved by the allocator.

Changing UnassignedInfo#getRemainingDelay logic to rely only on millis value could potentially increase time when shard sits in unallocated state due to clock drift anomaly. Doing that we need to ensure that delay time forms non-increasing sequence to guarantee eventual shard allocation.

tvernum commented 1 year ago

I can't think of a security issue in this area ... maybe @tvernum would know?

I'm not aware of one. Any answer more conclusive than that would require that I invest more time into understanding the original commit and the proposed change. Would that be helpful?