TelemetryDeck / Issues

When something is wrong in TelemetryDeck-Land, or if you have a cool idea how to make things better, this is the place to create an Issue.
https://telemetrydeck.com
36 stars 0 forks source link

Customers reporting users counting issue #387

Closed winsmith closed 4 months ago

winsmith commented 4 months ago

Customers reporting that users seem to be counted double, e.g. in the comments to https://github.com/TelemetryDeck/Issues/issues/386

winsmith commented 4 months ago

Looking at the data, April 17 seems completely fine, as does April 20. April 18 has a few outliers towards the end of the day, and April 19 is wildly out of range, especially around 17:00 UTC. This points towards an ingestion error of some sort, that I very likely introduced when trying to fix the dataset for April 17 and 18.

Let's say the affected time interval is 2024-04-17/2024-04-20 so we have some margin.

I’m going to try re-ingesting the data into a separate data source and see if that data source then looks good. This’ll take a few hours. If the reingested data looks fine, I'll overwrite the affected days in telemetry-signals with it

Image

winsmith commented 4 months ago

Ingestion Task has completed successfully after around 4 hours. Investigating...

alexstaravoitau commented 4 months ago

I also see this for data from 19/4 onward.

winsmith commented 4 months ago

I also see this for data from 19/4 onward.

I also noticed this. While the ingested test data looks correct, there is ANOTHER spike in the main data.

Image

So this is an ongoing issue with live ingestion. We've recently switched from self-hosted Kafka to Azure Event Hubs flavoured Kafka and changed the number of partitions down to 1.

However, there seem to be 2 Druid supervisors both ingesting from the same partition accidentally, which might introduce errors. I disabled one of them around 12:52 today, and at least according to Azure status, this is when the the Event Hubs stopped misbehaving.

Image

However, there are still spikes hours after that. Investigating some more.

winsmith commented 4 months ago

The spikes in the Azure Event Hub have completely stopped, so they probably have been caused by too many readers and not enough partitions it seems. However, we're still experiencing two things that are worrying:

  1. User numbers that seem excessively high
  2. Stray signals that seem to be from the future

Image

We recently switched from Kafka as a Message Queue to Azure Event Hubs, and I slightly updated the ingest API code to allow for streaming to both old and new message queue servers.

I'm now running an experiment where I ingest data from an old configuration using the old Kafka server into a separate data source to see if that makes a difference. In the meantime I'll also do a code review of the API code and add some more tests to try and see if that narrows down the problem space.

winsmith commented 4 months ago

Update: Old Kafka works as expected, Events Hub sends wrong data somehow. I'm going to consider reverting to Old Kafka, even though running that is more expensive and less compatible.

winsmith commented 4 months ago

A customer is reporting that they have a lot of signals that are missing certain parameters. This also seems to point to Azure Events Hub not transporting all signals the way Kafka does.

winsmith commented 4 months ago

I've switched ingestion of new signals back to the Old Kafka Server and will reinvest all signals since and including April 17. This should take a few hours.

winsmith commented 4 months ago

An ingest-and-replace did not fix the data between April 17 and 23. Instead, I'm going to completely remove all segments in that time period and re-ingest as with the test at the beginning of this conversation with myself.

winsmith commented 4 months ago

Re-ingest and replace fixed the numbers for April 17 through 23. Using the existing Kafka Server fixed the user counting issue. Marking this as resolved.