OpenTSDB / opentsdb

A scalable, distributed Time Series Database.
http://opentsdb.net
GNU Lesser General Public License v2.1
5k stars 1.25k forks source link

OpenTSDB can return unparseable JSON with duplicate timestamp keys #2166

Open annettejanewilson opened 3 years ago

annettejanewilson commented 3 years ago

Observed this where data was submitted at millisecond precision and queried back at second precision.

Query with millisecond precision:

curl --location --request GET 'http://opentsdb.local:4242/api/query?m=zimsum%3Aprod.undercity.search_cycle.meter.count_delta%7B%7D%7BdataCenter%3Dliteral_or%28EU_WEST_1%29%2Cendpoint%3Dliteral_or%28search%29%2Cfrom_bot%3Dliteral_or%28false%29%7D&start=1630064720&end=1630064727&ms'

Correct response with millisecond precision, formatted:

[
  {
    "metric": "prod.undercity.search_cycle.meter.count_delta",
    "tags": {
      "endpoint": "search",
      "dataCenter": "EU_WEST_1",
      "from_bot": "false"
    },
    "aggregateTags": [
      "ip",
      "traffic_source"
    ],
    "dps": {
      "1630064726121": 5,
      "1630064726122": 3,
      "1630064726731": 6
    }
  }
]

Query without millisecond precision:

curl --location --request GET 'http://opentsdb.local:4242/api/query?m=zimsum%3Aprod.undercity.search_cycle.meter.count_delta%7B%7D%7BdataCenter%3Dliteral_or%28EU_WEST_1%29%2Cendpoint%3Dliteral_or%28search%29%2Cfrom_bot%3Dliteral_or%28false%29%7D&start=1630064720&end=1630064727'

~Invalid~ Broadly unsupported JSON response, formatted (duplicate keys):

[
   {
      "metric":"prod.undercity.search_cycle.meter.count_delta",
      "tags":{
         "endpoint":"search",
         "dataCenter":"EU_WEST_1",
         "from_bot":"false"
      },
      "aggregateTags":[
         "ip",
         "traffic_source"
      ],
      "dps":{
         "1630064726":5,
         "1630064726":3,
         "1630064726":6
      }
   }
]

It looks like the timestamps are truncated to second-precision in the JSON serializer here: https://github.com/OpenTSDB/opentsdb/blob/7b4994080ebf823042dda0de4d44da494c1d398c/src/tsd/HttpJsonSerializer.java#L853-L854 When that code was introduced, there was apparently code to force downsampling if a query did not request downsampling and did not specify ms resolution. However, it seems that is no longer the case.

I was surprised to discover it's not actually invalid to have duplicate keys in JSON. However, it is broadly unsupported in JSON tools and it appears that the most common behaviour is to silently discard the duplicates.

I think we're not entirely up to date, but I couldn't see any issues or commits that appear to relate to this recently. Is this enough information to understand the problem or would you like me to attempt to replicate on master?

manolama commented 3 years ago

Doh you're right and good catch. I don't think there is anything to force the 1m downsampling particularly as we don't now if a data set has ms timestamps or not.

I'm kinda torn on the fix for this one:

  1. Apply the downsampler again but what would be a good default? Average? And it hides a data issue for the caller as they wouldn't necessarily know there was aggregated data in the results.
  2. Take the earliest or latest value for a timestamp? Also hides information for the caller.
  3. Throw an error? I'm leaning towards this as it lets the caller know they need to select a downsampling algorithm appropriate to the data they're fetching.

@annettejanewilson what do you think would make sense?

annettejanewilson commented 3 years ago

Hmmmm, hard to say. In our case here the problem is that the Java libraries record timestamps with millisecond precision even though the data is recorded only every ten seconds or something like that. So the value of the downsample isn't really that it reduces the number of data-points, but that it makes sure that they align and we have no "near misses" during aggregation.

In an ideal world, I would like it that so long as no timeseries has more than one datapoint recorded with timestamps that truncate to the same second value, we see the same results as we would have seen if the datapoints had been truncated at recording-time. From that point of view, the type of downsample applied is unimportant. Probably 1s-last-none would be closest to what we expect?

I'm sympathetic to the desire to make it an error. We do discourage users from making queries with no downsample at all, so I consider this query to be unwise. But I haven't trawled through query logs to see how often people are making such queries. I suspect it could be pretty common. I'd be reluctant to break lots of queries that don't know or care about millisecond precision and for which no individual time-series ever records datapoints that close together.