I noticed missing sequences when using GenerateSequence + KafkaIO to produce records to Kafka, when relying on the automatic topic creation feature (auto.create.topics.enable).
And consuming using KafkaIO.read() + writing to a file:
Pipeline p = Pipeline.create(pipelineOptions);
List<TopicPartition> partitions = new ArrayList<>();
for (int i = 0; i < 20; i++) {
partitions.add(new TopicPartition(TOPIC_NAME, i));
}
PCollection<KV<Long, KV<String, Tick>>> ticks = p.apply(
KafkaIO.<Long, String>read()
.withBootstrapServers(BOOTSTRAP_SERVERS)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.withTopicPartitions(partitions))
.apply(MapElements.into(TypeDescriptors.strings()).via(record -> record.getKV().getValue()));
ticks.apply("Window", Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(
TextIO.write().to("gs://test-bucket/" + TOPIC_NAME + "/keys-")
.withNumShards(0)
.withWindowedWrites());
p.run();
Then I copy the files from gcs (gsutil -m cp gs://test-bucket/{TOPIC}/* /tmp/test) and created a simple class to check for continuity:
public class TestContiguous {
public static void main(String[] args) {
findMissingNumbersInFiles(new File("/tmp/example-stream/"));
}
public static List<Integer> findMissingNumbersInFiles(File folder) {
File[] files = folder.listFiles();
TreeSet<Integer> numbers = new TreeSet<>();
for (File file : files) {
numbers.addAll(readNumbersFromFile(file));
}
List<Integer> missingNumbers = new ArrayList<>();
int minNumber = numbers.first();
int maxNumber = numbers.last();
for (int i = minNumber; i <= maxNumber; i++) {
if (!numbers.contains(i)) {
missingNumbers.add(i);
}
}
System.out.println(
"Min: " + minNumber + ", Max: " + maxNumber + ". Missing: " + missingNumbers);
return missingNumbers;
}
public static List<Integer> readNumbersFromFile(File file) {
List<Integer> numbers = new ArrayList<>();
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine().trim();
if (!line.isEmpty()) {
try {
int number = Integer.parseInt(line.split(",")[0]);
numbers.add(number);
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
return numbers;
}
}
It is expected that there will be some records missing at the tail end due to parallelism, but I've always encountered about ~1k of sequences missing at the beginning of the pipeline:
What happened?
I noticed missing sequences when using
GenerateSequence + KafkaIO
to produce records to Kafka, when relying on the automatic topic creation feature (auto.create.topics.enable).What I've used to reproduce:
And consuming using KafkaIO.read() + writing to a file:
Then I copy the files from gcs (
gsutil -m cp gs://test-bucket/{TOPIC}/* /tmp/test
) and created a simple class to check for continuity:It is expected that there will be some records missing at the tail end due to parallelism, but I've always encountered about ~1k of sequences missing at the beginning of the pipeline:
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components