confluentinc / kafka-connect-hdfs

Kafka Connect HDFS connector
Other
6 stars 396 forks source link

low rate of consumption in some of partitions in topic #473

Open kullatomer opened 4 years ago

kullatomer commented 4 years ago

Hello team, we are running the hdfs connector with following config: { "name": "hdfs-connector", "config": { "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector", "tasks.max": "30", "topics": "kulla_hdfs_test", "hdfs.url": URL, "flush.size": "30000", "rotate.interval.ms" : 10000, "name": "hdfs-connector", "format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat", "key.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable":"true", "value.converter.schemas.enable":"true", "schemas.enable":"true", "errors.tolerance": "all" } }

our topic has 30 partitions and the messages format is Json (with schema and payload) as you can see we save the files on hdfs in parquet..

when we start the connector there are about 15mil messages in the topic (split equally on each partition)

we're running the connector with 9 workers instances (3 machines and 3 dockers on each machine) in cluster mode (they all share the same group.id)..

we noticed a strange behavior in consumption: some of the partitions are neglected, meaning: while in some of the partitions the offset is advancing fast and others very slow..

note: in each run different partitions are neglected also, all the tasks are in RUNNING state.

why this behavior is happening?

thanks.

ncliang commented 4 years ago

@kullatomer do you see any particular task or worker lagging behind? Perhaps due to slower machine, docker imposed CPU limits on the container, other processes interfering on a host machine, etc? You can use JMX to monitor sink-record-send-rate to see the send rate per task. Another counter to look at might be partition-count to see if distribution of partitions to tasks is uneven.

Many more connect-related counters can be found at http://kafka.apache.org/documentation.html#connect_monitoring.

You can monitor these with simple jconsole or setup something like prometheus or grafana to get graphs and dashboards.

OneCricketeer commented 4 years ago

Note - Tasks aren't necessarily evenly distributed over partitions. They frequently toggle and rebalance between file flushes