DataCater / datacater

The developer-friendly ETL platform for transforming data in real-time. Based on Apache Kafka® and Kubernetes®.
https://datacater.io
Other
82 stars 4 forks source link

Wrong amount of records displayed in pipeline designer #155

Closed olis1996 closed 1 year ago

olis1996 commented 1 year ago

Description: Inside the pipeline designer the wrong number of records is displayed.

Steps to reproduce: 1.) Load 200 records into the kafka topic. 2.) Go to the pipeline designer. 3.) Per default 100 sample records are being displayed. 4.) Open sidebar and set sample size to 200. 5.) Save settings and close sidebar. => 186 records are being displayed

image

6.) Set sample size to anything >= 241 => 200 records are being displayed correctly

image image

flippingbits commented 1 year ago

This might be related to the implementation of the KafkaStreamsAdmin.inspect method and is certainly confusing for users.

Are you using a Kafka topic with multiple partitions, where some partitions contain no or only a few records?

olis1996 commented 1 year ago

There are 3 partitions, which might happen to contain very different amounts of records

flippingbits commented 1 year ago

When we retrieve n sample records from a topic with m partitions, we currently try to consume n/m records from each partition.

If any of the partitions holds less than n/m records, this might lead to the unexpected situation where the /streams/:uuid/inspect endpoint returns less records than requested, even if the topic holds >= n records across all partitions.

I propose to change the sampling approach such that we always return n records if the topic holds >= n records across all partitions, regardless of any skew.

HknLof commented 1 year ago

@olis1996 For now, it might be useful to create a topic with a single partition to manually test your data-generator.

ChrisRousey commented 1 year ago

We will change the Stream/Inspect to offer two modes of operation, which can be switched with a variable flag

The first mode will be the new default and will be to retrieve the messages top-down. We will try and get all messages from the first partition, if that set doesn't contain x, we will continue with the next partition etc.

The second mode will be the current implementation where we try and evenly spread the amount retrieved accross all partitions.

ChrisRousey commented 1 year ago

This has been fixedwith PR #158