Open TomonoriSoejima opened 4 years ago
[failed shard, shard [index-2020.06.20-000115][7], node[aaa], relocating [BVBS4C1uR0SRCGBEGWTTPg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=oeYkS4F6QjmWkRBvo-x4Gw, rId=ByuTpxo_S5-zXnVH1QvAIg], expected_shard_size[38786827396], message [failed recovery], failure [RecoveryFailedException[[ipfixdata-2020.06.20-000115][7]: Recovery failed from {ccc}{t1vqzdlNQnyCSolqHhYFAw}{i1-SgYkTTgyn_kFuxG-dkg}{172.25.178.85}{172.25.178.85:9301}{dl}{ml.machine_memory=540447649792, rack=r3, ml.max_open_jobs=20, xpack.installed=true} into {aaa}{DQs9Qvg2TYa__C6_l924MQ}{5oIy4EgJQ0OIWiA0A9mN5A}{172.19.226.76}{172.19.226.76:9301}{dl}{ml.machine_memory=540447649792, rack=r8, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[ccc][172.25.178.85:9301][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30665603222/28.5gb], which is larger than the limit of [30079536332/28gb], real usage: [30664554016/28.5gb], new bytes reserved: [1049206/1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=2098436/2mb, accounting=0/0b]]; ], markAsStale [true]]
The message above is constructed from
@Override
public String toString() {
return "failed shard, shard " + routingEntry + ", message [" + message + "], failure [" +
ExceptionsHelper.detailedMessage(failure) + "], markAsStale [" + markAsStale + "]";
}
routingEntry
[index-2020.06.20-000115][7], node[aaa], relocating [BVBS4C1uR0SRCGBEGWTTPg], [R],
recovery_source[peer recovery],
s[INITIALIZING], a[id=oeYkS4F6QjmWkRBvo-x4Gw, rId=ByuTpxo_S5-zXnVH1QvAIg], expected_shard_size[38786827396]
[index-2020.06.20-000115][7]
: shard id 7
of index index-2020.06.20-000115
.
node[aaa]
: the shard is currently on the node aaa
relocating [BVBS4C1uR0SRCGBEGWTTPg]
: it is moving to node BVBS4C1uR0SRCGBEGWTTPg
.
[R]
means replica while [P]
stands for primary.
recovery_source[peer recovery]
comes from here
peer recovery
means recovery from a primary on another node./**
* Represents the recovery source of a shard. Available recovery types are:
*
* - {@link EmptyStoreRecoverySource} recovery from an empty store
* - {@link ExistingStoreRecoverySource} recovery from an existing store
* - {@link PeerRecoverySource} recovery from a primary on another node
* - {@link SnapshotRecoverySource} recovery from a snapshot
* - {@link LocalShardsRecoverySource} recovery from other shards of another index on the same node
*/
s[INITIALIZING]
: represents the current state of the shard
a[id=oeYkS4F6QjmWkRBvo-x4Gw, rId=ByuTpxo_S5-zXnVH1QvAIg]
: Allocation ID for each replica shard
id=oeYkS4F6QjmWkRBvo-x4Gw
: allocation IDrId=ByuTpxo_S5-zXnVH1QvAIg
: relocation_idexpected_shard_size[38786827396]
: shard size in bytes. eg 36.12GB
message
failAndRemoveShard(shardRouting, sendShardFailure, "failed recovery", failure, clusterService.state());
in hereExceptionsHelper.detailedMessage(failure)
public RecoveryFailedException(ShardId shardId,
DiscoveryNode sourceNode,
DiscoveryNode targetNode,
@Nullable String extraInfo,
Throwable cause) {
super(shardId + ": Recovery failed " + (sourceNode != null ? "from " + sourceNode + " into " : "on ") +
targetNode + (extraInfo == null ? "" : " (" + extraInfo + ")"), cause);
}
[2020-06-24T14:10:02,478][WARN ][o.e.c.r.a.AllocationService] [xxx] failing shard [failed shard, shard [index-2020.06.20-000115][7], node[aaa], relocating [BVBS4C1uR0SRCGBEGWTTPg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=oeYkS4F6QjmWkRBvo-x4Gw, rId=ByuTpxo_S5-zXnVH1QvAIg], expected_shard_size[38786827396], message [failed recovery], failure [RecoveryFailedException[[ipfixdata-2020.06.20-000115][7]: Recovery failed from {ccc}{t1vqzdlNQnyCSolqHhYFAw}{i1-SgYkTTgyn_kFuxG-dkg}{172.25.178.85}{172.25.178.85:9301}{dl}{ml.machine_memory=540447649792, rack=r3, ml.max_open_jobs=20, xpack.installed=true} into {aaa}{DQs9Qvg2TYa__C6_l924MQ}{5oIy4EgJQ0OIWiA0A9mN5A}{172.19.226.76}{172.19.226.76:9301}{dl}{ml.machine_memory=540447649792, rack=r8, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[ccc][172.25.178.85:9301][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30665603222/28.5gb], which is larger than the limit of [30079536332/28gb], real usage: [30664554016/28.5gb], new bytes reserved: [1049206/1mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=2098436/2mb, accounting=0/0b]]; ], markAsStale [true]]