elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.19k stars 3.5k forks source link

[META] PQ Robustness Improvements #9494

Closed andrewvc closed 4 years ago

andrewvc commented 6 years ago

We have, unfortunately, seen some reports of PQ issues in the field. This meta issue is here to track our approach to dealing with these. We previously were tracking this here: https://github.com/elastic/logstash/pull/9322 , but that's just around one issue, we should be more comprehensive.

TL;DR

If users are using a version < 6.3.0 they should drain their queue before upgrading, then upgrade to 6.3.0+ ONLY.

Our only reasonable fix is to:

  1. Increment the PQ version
  2. If users attempt to read pre-6.3.0 queues, attempt to validate the data and its replay-ability. If it is replayable proceed as normal. Otherwise, display an error message asking them to drain it in the old version

Serialization Bug History

Impact Bug Version Introduced Version Fixed Issue
Some non-ASCII strings are serialized incorrectly, and cannot be read by 6.2.4. Users should drain their queue before upgrading. These strings would be readable pre-6.2.4 but be corrupted. String serialization corruption TBD 6.2.4 https://github.com/elastic/logstash/pull/9307
Timestamp values in fields other than @timestamp would be serialized as strings. Upgrading to 6.1.0+ will not fix previously serialized data, but will serialize new Events correctly. Timestamp encoding fixed 5.4.0 6.1.0 https://github.com/elastic/logstash/pull/8239
Versions prior to 6.1.0 would sometimes corrupt BigDecimal, BigInteger, Timestamp, Boolean and potentially other values. These queues must be drained prior to upgrading. BiValues 5.4.0 6.1.0 https://github.com/elastic/logstash/pull/8239

File/IO Consistency Issues

Impact Bug Version Introduced Version Fixed Issue
Version byte not readable on file open Root cause is mysnc not being called / finishing during an OOM. A partial msync can cause this. TBD 5.6.10, 6.3.0 https://github.com/elastic/logstash/issues/9483
Observed primarily in PQ tests Checkpoint files occasionally missing N/A 6.3.0 https://github.com/elastic/logstash/issues/9364

Tasks

andrewvc commented 6 years ago

Updated to reflect @tsg 's idea to make a best effort to recover old data in the cases where possible.

tsg commented 6 years ago

Linking https://github.com/elastic/logstash/pull/9538 in here, which implements the immediate solution of bumping the PQ major and attempting to migrate old queues via a dry run.