Analyze disk usage action doesn't respect `error_trace`

nik9000 commented 2 years ago

Elasticsearch Version

master branch

Installed Plugins

none

Java Version

bundled

OS Version

Linux porco 5.16.10-arch1-1 #1 SMP PREEMPT Wed, 16 Feb 2022 19:35:18 +0000 x86_64 GNU/Linux

Problem Description

If I run into an error running the _disk_usage API and send error_trace I don't get back the stack trace. It isn't logged anywhere either.

Steps to Reproduce

Run ES on master. Run the rally geopointshape track:

esrally race --track-path ../rally-tracks/geopointshape/ --test-mode --pipeline benchmark-only --client-options="basic_auth_user:'elastic',basic_auth_password:'password'"

Fetch disk telemetry:

curl -uelastic:password -XPOST 'localhost:9200/osmgeoshapes/_disk_usage?error_trace=true&run_expensive_tasks=true'

Returns:

{"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"shard":0,"index":"osmgeoshapes","status":"INTERNAL_SERVER_ERROR","reason":{"type":"array_index_out_of_bounds_exception","reason":"Array index out of range: 20"}}]}}

There's some other error hiding in here. It shouldn't get index out of range. But life happens sometimes.

Logs (if relevant)

No response

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

nik9000 commented 2 years ago

Here's the error trace I'd expect:

[[osmgeoshapes][0] failed, reason [[osmgeoshapes/tt5UjNGDTDGjLQpYIJ7uTQ][[osmgeoshapes][0]] org.elasticsearch.action.support.broadcast.BroadcastShardOperationFailedException:
    at org.elasticsearch.action.support.broadcast.TransportBroadcastAction$AsyncBroadcastAction.setFailure(TransportBroadcastAction.java:271)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastAction$AsyncBroadcastAction.onOperation(TransportBroadcastAction.java:213)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastAction$AsyncBroadcastAction$1.handleException(TransportBroadcastAction.java:191)
    at org.elasticsearch.transport.TransportService$4.handleException(TransportService.java:724)
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349)
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458)
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1432)
    at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastAction$ShardTransportHandler.lambda$messageReceived$1(TransportBroadcastAction.java:299)
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144)
    at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:764)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:admin/analyze_disk_usage[s]]
Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 20
    at java.base/java.util.Arrays.rangeCheck(Arrays.java:725)
    at java.base/java.util.Arrays.compareUnsigned(Arrays.java:5919)
    at org.elasticsearch.action.admin.indices.diskusage.IndexDiskUsageAnalyzer$PointsVisitor.compare(IndexDiskUsageAnalyzer.java:421)
    at org.apache.lucene.index.PointValues.intersect(PointValues.java:342)
    at org.apache.lucene.index.PointValues.intersect(PointValues.java:337)
    at org.elasticsearch.action.admin.indices.diskusage.IndexDiskUsageAnalyzer.analyzePoints(IndexDiskUsageAnalyzer.java:389)
    at org.elasticsearch.action.admin.indices.diskusage.IndexDiskUsageAnalyzer.doAnalyze(IndexDiskUsageAnalyzer.java:112)
    at org.elasticsearch.action.admin.indices.diskusage.IndexDiskUsageAnalyzer.analyze(IndexDiskUsageAnalyzer.java:86)
    at org.elasticsearch.action.admin.indices.diskusage.TransportAnalyzeIndexDiskUsageAction.shardOperation(TransportAnalyzeIndexDiskUsageAction.java:94)
    at org.elasticsearch.action.admin.indices.diskusage.TransportAnalyzeIndexDiskUsageAction.shardOperation(TransportAnalyzeIndexDiskUsageAction.java:43)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastAction.lambda$asyncShardOperation$0(TransportBroadcastAction.java:317)
    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:776)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
]]

nik9000 commented 2 years ago

This seems to be caused by the minPackedValue and maxPackedValue coming back as 16 bytes and our offset being 16 bytes. I don't actually know how any of this works so I have no idea if this is an ES bug or a lucene one or what.

dnhatn commented 2 years ago

I think this is a more general issue than the disk_usage API. I have opened https://github.com/elastic/elasticsearch/issues/84831. I will close this issue and dig into the point values problem.

dnhatn commented 2 years ago

@nik9000 I can reproduce the issue with the geopointshape track and have the fix in https://github.com/elastic/elasticsearch/pull/84909.

elastic / elasticsearch