Rogue collector emitting high cardinality tagv - how to find metric / collectors doing this?!

OpenTSDB / opentsdb

A scalable, distributed Time Series Database.

http://opentsdb.net

GNU Lesser General Public License v2.1

5k stars 1.25k forks source link

Rogue collector emitting high cardinality tagv - how to find metric / collectors doing this?! #1294

Open HariSekhon opened 6 years ago

HariSekhon commented 6 years ago

I've discovered somebody is emitting millions of unique incrementing integers as tagv values (Bosun alert triggered on large tagv uid increased in last 24 hours).

I'm pretty certain this is a misbehaving collector that has been introduced somewhere in the infrastructure by one of the teams, but they have their own git repos so I can't just do a simple git log to find changes across the company.

I previously wrote a script to get metrics/tagk/tagv by age which I used to find these tagv (script is in my PyTools repo along with other opentsdb and hbase tools) which I've just used to find these recently added tagv values and they appear to be large increasing integer numbers (but not epochs by the look of it), so it's probably a bug in a newly added collector.

In the opentsdb logs I can see dozens of entries like the following per second in each of the write TSD logs with slightly different numbers where 'nnnnn....' represents a large integer:

2018-08-02T09:58:03.939+0000 INFO  [UniqueId.allocateUid] - Creating an ID for kind='tagv' name='nnnnnnnnnn'
2018-08-02T09:58:03.939+0000 INFO  [UniqueId.createReverseMapping] - Got ID=nnnnnnnfor kind='tagv' name='nnnnnnnnnn'
2018-08-02T09:58:03.941+0000 INFO  [UniqueId.done] - Wrote UIDMeta for: nnnnnnnnnn
2018-08-02T09:58:03.941+0000 INFO  [UniqueId.done] - Completed pending assignment for: nnnnnnnnnn

How can I find which metrics these tagv values are belonging to so that I can find out what team is doing this? (our metrics are prefixed by team)

Alternatively how can I find the host(s) / IP addresses where these tagv registrations are coming from?

HariSekhon commented 6 years ago

Ok so the logs for hosts aren't gonna help because the ConnectoinMagnager logs in opentsdb.log are showing the addresses of our per colo write proxies and not the tcollector hosts.

2018-08-02T10:26:07.515+0000 INFO  [ConnectionManager.handleUpstream] - [id: 0xffffffff, /x.x.x.x:59690 => /y.y.y.y:4242] OPEN

I tried increasing log level to DEBUG in logback.xml and restarted but didn't get any more useful information from the logs that I could see.

Even grep DEBUG opentsdb.log showed mainly UniqueId.preloadUidCache and DEBUG [CompactionQueue.run] - flush() took Nms and nothing useful to determining the source of these excessive high cardinality tagv registrations.

HariSekhon commented 6 years ago

Using the same opentsdb_list_metrics.sh --by-age script against metrics and tagk I found similar registrations for the time these high cardinality tagv started being registered:

./opentsdb_list_metrics_hbase.sh --tagk --by-age
...
1533136937276 2018-08-01 15:22:17 partition
1533136937279 2018-08-01 15:22:17 logendoffset
1533136937286 2018-08-01 15:22:17 currentoffset
...

and for metrics only this one matched around that time:

./opentsdb_list_metrics_hbase.sh --by-age
...
1533136937272 2018-08-01 15:22:17 sandbox.<company>.metron.kafka.lag
...

So I managed to chase this down - it turned out to be a developer being naughty by logging Kafka partition offsets in tagv.

I also feel a little lucky that this was a new collector that introduced a new metric and tagk for me to correlated.

If somebody modified an existing metric and introduced a bug to a collector generating these high cardinality tagk/tagv then how would I find out where those were coming from?

Perhaps OpenTSDB logging needs to log the source of UID creation so we can find badly behaving collectors?

Logging the IP and the metric for which the new tagk / tagv values are being created may be a good idea:

the IP alone isn't enough as there could be lots of collectors running on that host or it could be behind a proxy as in our environment)
metric would allow us to chase down the right team basic on the metric name of whoever is working on that technology / metric.

johann8384 commented 6 years ago

We could possibly add more detail to the log events, I would need to look at the code to see what we know at that time.

You may also want to consider configuring a whitelist for metric and tagk creation, of course in this specific case it was tagv.

On Thu, Aug 2, 2018, 5:57 AM Hari Sekhon notifications@github.com wrote:

Using the same opentsdb_list_metrics.sh --by-age script against metrics and tagk I found similar registrations for the time these high cardinality tagv started being registered:

./opentsdb_list_metrics.sh --tagk --by-age ... 1533136937276 2018-08-01 15:22:17 partition 1533136937279 2018-08-01 15:22:17 logendoffset 1533136937286 2018-08-01 15:22:17 currentoffset ...

and for metrics only this one matched around that time:

./opentsdb_list_metrics.sh --by-age ... 1533136937272 2018-08-01 15:22:17 sandbox..metron.kafka.lag ...

So I managed to chase this down but I also feel a little lucky that this was a new collector that introduced a new metric and tagk for me to correlated.

If somebody modified an existing metric and introduced a bug to a collector generating these high cardinality tagk/tagv then how would I find out where those were coming from?

Perhaps OpenTSDB logging needs to log the source of UID creation so we can find badly behaving collectors?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenTSDB/opentsdb/issues/1294#issuecomment-409888190, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJcMY6OtMoXOvxiTVZ5O7jn7C0gwUftks5uMtsMgaJpZM4VsAx0 .

manolama commented 5 years ago

Yeah we'd need to enable trace logging that includes the packets and IPs but man that could get expensive.

johann8384 commented 5 years ago

I have had to do this a few times, I ended up doing it two ways, I modified TCollector to store a count of metrics, tagk, and tagv on each node periodically and shipped those as facts with my config mgmt system. The second way was just using tshark and a filter on the edge nodes where all the ingest went to see where it was coming from.

You can use the whiltelist/blacklist functionality to help clean this up a bit for metrics and tagk. It's not too useful for tagv in most cases.

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Tue, Dec 11, 2018 at 2:44 PM Chris Larsen notifications@github.com wrote:

Yeah we'd need to enable trace logging that includes the packets and IPs but man that could get expensive.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenTSDB/opentsdb/issues/1294#issuecomment-446355078, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJcMdA470Oq6o_t3FzxnkeLhM-ms9XYks5u4Bk8gaJpZM4VsAx0 .