fangli / fluent-plugin-influxdb

A buffered output plugin for fluentd and InfluxDB
MIT License
111 stars 65 forks source link

Using nanosecond precision instead of sequence_tag #87

Open candlerb opened 6 years ago

candlerb commented 6 years ago

This is just a suggestion for discussion/consideration.

Currently the plugin increments sequence_tag when multiple entries with identical timestamp occur, and resets it to zero when an event with different timestamp is seen. There are two problems with this.

  1. You could end up generating a lot of time series. Suppose, for example, that in one particular second you have an unusual burst of 1000 messages. You'll end up creating 1000 time series, and in future all queries will have to be across all those 1000 series, even though mostly they will be empty.

  2. If the records are being written with timestamps that are not increasingly linearly - e.g. there is some "backfill" going on - then there is a high risk of overwriting previous data, because the sequence tag is reset to zero every time the source timestamp changes. This is mainly a problem when the source timestamp has only 1-second resolution.

I would like to suggest another approach, which is to make use of the nanosecond timestamp precision of influxdb.

There are several approaches to this, but I propose the following one:

This means that the stored timestamps are at worst in error by one millisecond. The chances of conflict are extremely low. If your input records have 1-second resolution then either you would have to have one million events in a single second, or you would have to be backfilling to a previous second and be unlucky enough to hit the same range of timestamps.

Even if your input records have millisecond resolution, as long as they arrive with monotonically increasing timestamps there should not be a problem, although some reordering is possible for records in adjacent milliseconds when the counter wraps. Maybe when the time precision is millisecond or better, the counter should only run between 0 and 999, so the error is no worse than one microsecond. (This upper limit can be made configurable anyway)

I did consider some other options - e.g. generating a random offset between 0 and 999999 for each record, or using a hash of the record. The former has a higher probability of collision (around 50% probability of at least one collision when 1000 records are stored in a single second). The latter will not record repeated records at all, which is undesirable for logs where you may actually want to count multiple instances of the same event.


[^1]: If time_precision rounding is required then do it in the plugin and convert back to nanoseconds; but I don't really see why anyone would want to reduce the resolution of their existing timestamps.

candlerb commented 6 years ago

There were some limitations to that algorithm. I think this one is much better, inspired by the current sequence_tag logic.

Let's say that era starts at 123, and we get a series of events with timestamps 00:00:00, 00:00:03, 00:00:08 (three events), 00:00:10, 00:00:13. These would be recorded as:

Should time ever go backwards (i.e. backfilling takes place) then a new random era will be chosen; the probability of collision is low. The larger era range you chose, the lower the collision probability but the higher the timestamp error. Even with era up to 106 the error is only up to one millisecond.

This algorithm works when the source events have nanosecond resolution; it also allows unlimited numbers of events with the same timestamp. Example:

Finally: the offset value (123 to 126 above) could be emitted as a field, not a tag. This would allow the true original timestamp to be calculated, without creating a new time series for each distinct value.