apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.4k stars 3.68k forks source link

Mitigate ingestion failure when the partition id exceeds Short.MAX_VALUE for a given interval #15356

Closed AmatyaAvadhanula closed 1 day ago

AmatyaAvadhanula commented 10 months ago

Problem

At present, Druid has a limitation that the maximum partition id for a given interval may not exceed Short.MAX_VALUE (32767). Appending data with dynamic partitioning may fail when the allocated segments have a greater partition id.

Users may notice that the failing job has an error message like java.lang.IllegalArgumentException: fromKey > toKey, which is quite vague.

This limit exists today because of segment locking as partitions ranging from 32767 to 65535 are reserved for second generation partitions. (Please refer to https://github.com/apache/druid/issues/7491 for more details)

In general, having more than 32k segments per interval is not recommended and may impact other areas of the system.

Mitigation

A potential solution to this problem are to pause the appending jobs, compact the data to have fewer partitions, and then resume the appending jobs.

If the volume of ingestion for the targeted segment granularity is very high, one may also consider ingesting with a finer segment granularity.

Future work

A common reason for partition space exhaustion is multiple append failures after pending segment allocation. A fix for this could be to clean up such pending segments so that the next task may reuse the partition id for its own allocation.

Another common reason for this could be the addition of several small segments due to late arriving data. Configuring auto-compaction can significantly reduce the number of segments in such cases.

Concurrent compaction [Experimental] can also be used when the user intends to compact data without pausing ingestion for that interval.

Once the above feature is mature enough to not be labelled experimental, segment locking could be deprecated and the limitation on the partition ids may no longer be needed.

github-actions[bot] commented 4 weeks ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 1 day ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.