apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.53k stars 3.71k forks source link

[PROPOSAL] Support only finiteFirehose for native batch ingestion #7071

Open jihoonson opened 5 years ago

jihoonson commented 5 years ago

Motivation

Currently native batch tasks (local and parallel index tasks) support any firehose implementation. However, it isn't very useful when firehose is an infinite one because they don't have any context about stream ingestion.

Proposed changes

I propose to change the type of firehose of IndexIOConfig and ParallelIndexIOConfig from FirehoseFactory to FiniteFirehoseFactory.

Rationale

FiniteFirehoseFactory is designed for any type of batch ingestion. It assumes that input data is finite (and provides an optional hint for parallel indexing). It makes more sense to support only FiniteFirehoseFactory for native batch tasks rather than improve them to support any kind of firehoseFactory which may be designed for stream input data.

Operational impact

There's no change in the task spec because the variable name isn't changed.

Custom firehoseFactory implementations for native batch tasks need to be updated.

Future work

This change effectively makes native batch tasks to support only text file formats by default because all implementations of FiniteFirehoseFactory are using StringInputRowParser. https://github.com/apache/incubator-druid/issues/5584 should be solved to support various file formats.

jihoonson commented 5 years ago

This should be resolved after https://github.com/apache/incubator-druid/pull/7048.

glasser commented 5 years ago

How would this apply to the delegating implementations like CombiningFirehoseFactory, ClippedFirehoseFactory, and FixedCountFirehoseFactory? I don't know to what degree they are actually used, but Clipped seems like something that could be useful with the LocalFirehoseFactory and local index task. Would these need to implement FiniteFirehoseFactory too?

jihoonson commented 5 years ago

Good question. ClippedFirehoseFactory is for Tranquility, so I don't think it needs to be finiteFirehoseFactory. I'm not sure who is using FixedCountFirehoseFactory, but it was added in https://github.com/apache/incubator-druid/pull/3856 and looks its purpose was testing.

For CombiningFirehoseFactory, I think it would be useful and worth to add CombiningFiniteFirehoseFactory which supports split.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.