apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.29k stars 3.66k forks source link

realtime cleanup and renaming #16652

Open clintropolis opened 6 days ago

clintropolis commented 6 days ago

changes:

Discussion: should we move metric ingest/sink/count to ingest/appendableSegment/count or something similar? That is kind of disruptive since there is like a continuity of measurement issue which is kind of sad. We could also always double emit it, but also sad in its own way.

Release note

todo

This PR has:

kfaraz commented 5 days ago

Discussion: should we move metric ingest/sink/count to ingest/appendableSegment/count or something similar? That is kind of disruptive since there is like a continuity of measurement issue which is kind of sad. We could also always double emit it, but also sad in its own way.

I think it is okay to deprecate the older metric but keep emitting it along with the new one for a couple of releases. We are already emitting a good number of metrics, one more is not going to hurt. But at least this way we would be able to eventually get rid of the old one.

Also, @clintropolis , what do you think about calling it RealtimeSegment instead of AppendableSegment? "Realtime segment" is a term that Druid community is already familiar with. "Appendable segment" seems slightly ambiguous as it could be misinterpreted by some to mean a segment to which more data can be added even after it has been committed.

clintropolis commented 5 days ago

Also, @clintropolis , what do you think about calling it RealtimeSegment instead of AppendableSegment? "Realtime segment" is a term that Druid community is already familiar with. "Appendable segment" seems slightly ambiguous as it could be misinterpreted by some to mean a segment to which more data can be added even after it has been committed.

My reasoning for AppendableSegment was because they are built with an Appenderator. Also "realtime" isn't really appropriate because they are also created during batch processing. I would be ok changing the metric name to something else, but I think the current name is more appropriate in code at least. I am open to other suggestions though, I considered 'mutable' segment but that is really only part of the AppendableSegment which is composed of one mutable PartialSegment and several other immutable intermediary persisted PartialSegment (using the new naming).

kfaraz commented 5 days ago

Also "realtime" isn't really appropriate because they are also created during batch processing.

Yeah, I wondered if this was the case, didn't see that BatchAppenderator itself is using sinks. If I am not mistaken, a sink should always corresponding to a pending segment i.e. an entry in the pendingSegments metadata table. So maybe PendingSegment could be a better name?

clintropolis commented 5 days ago

If I am not mistaken, a sink should always corresponding to a pending segment i.e. an entry in the pendingSegments metadata table. So maybe PendingSegment could be a better name?

PendingSegment doesn't seem quite right to me because it seems like an odd name for the role that this thing plays in code. I think my problem here is the word 'pending' which seems like something you wouldn't want to be using in stuff like query processing because it sounds like something that isn't 'official' yet - which makes sense for how it is used for task locks, but less so for the internal code stuff imo.

What don't you like about AppendableSegment as an internal code construct name? Should we rename Appenderator too? I don't think it would be useful for the docs to talk about them as this and agree it might be confusing, but I don't think the docs widely called them Sink either since this is just internal naming stuff.

kfaraz commented 5 days ago

I think my problem here is the word 'pending' which seems like something you wouldn't want to be using in stuff like query processing because it sounds like something that isn't 'official' yet

Agreed. I have myself been on the fence about the 'pending' part too as it is certainly confusing. But then, I am not sure what a better alternative would have been either (maybe open, appending or appendable itself 😛 ).

In that regard, I do like AppendableSegment as it does clarify that this is a segment to which we can currently append more rows. I just wanted to be sure before we introduced a new(-ish) term in the code.

You seem to have given it enough thought and I do agree with the arguments against the other options, so let's go ahead with AppendableSegment!

I don't think we need to rename Appenderator as it works fine for its role.