elastic / support-diagnostics

Support diagnostics utility for elasticsearch and logstash
Other
289 stars 150 forks source link

monitoring export : Read time out cause truncated data without retrying #681

Open jguay opened 5 months ago

jguay commented 5 months ago

Here is output from extract.log :

15:30:29.403 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - Now extracting index_stats...
15:30:40.905 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - 2920748 documents retrieved. Writing to disk.
15:30:40.929 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - 1000 of 2920748 processed.
15:30:51.980 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - 2000 of 2920748 processed.
15:31:03.594 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - 3000 of 2920748 processed.
15:31:33.481 [main] INFO  co.elastic.support.monitoring.MonitoringExportService - 4000 of 2920748 processed.
15:33:33.487 [main] ERROR co.elastic.support.rest.RestClient - Unexpected Execution Error
java.net.SocketTimeoutException: Read timed out

As a result index_stats.json has exactly 4000 lines and diagnostics execution did not fail

2 issues there : 1- A networking Read timed out which in this case is likely a performance issue with the monitoring cluster is a retriable error (so long as the scroll ID does not expire) so we should likely retry 3 times 2- It may be better to fail the whole execution than to provide a monitoring export zip file that is later missing most data

As note https://github.com/elastic/support-diagnostics/issues/254 discussed exposing timeout settings. However from 8.5.0 code, I assume we don't expose it yet and it appears the retry does not apply for monitoring extraction