Druid does not intermittently drop segments past retention time

mounikanakkala commented 2 years ago

Druid does not intermittently drop segments past retention time. This led to org.apache.druid.segment.SegmentMissingException on our systems

Affected Version

0.22.1

Description

What happened We have a datasource where we set Retention rules - loadByPeriod(P24M+future), dropForever. Datasource segment granularity is Hour.

We encountered an erroneous case where a segment that is past 24 months did not get deleted properly.

Segments page shows the segment as available and after some refreshes, it doesn't show. But after some more refreshes it reappears.
We ran a query on sys.server_segments table
```
select * 
from sys.server_segments
where segment_id = <segment_id>
```
It returned two historicals having that segment. Since we have Druid cluster setup on Kubernetes, we deleted the two historical pods and that's when the segments were no longer available on Druid and the issue was resolved.

How often is this issue occurring It doesn't happen with all segments but happens for 1-2 segments once in a few days.

More details on Druid cluster setup

Druid processes - Coordinator, middle managers, historicals, broker, router are on Kubernetes.
Historicals use AWS EBS as Persistence volume. This means data is actually stored on EBS and when Historical pod is removed, another pod is created within minutes and the EBS gets attached to this new pod.
When we deleted the pod as mentioned above, the issue got resolved. Since EBS is not affected, I suppose it means that there was some main-memory information that was still there on Historical but it was not supposed to.

How did we come across this issue Time was 2022-04-19T03. Segment that did not get deleted was 2020-02-19T00 even though it was past 24 months. We ran time boundary query

{
    "dataSource": "our_datasource",
    "queryType": "timeBoundary",
    "bound": "minTime"
}

We got the following exception

org.apache.druid.server.QueryResource - Exception handling request: {class=org.apache.druid.server.QueryResource, exceptionType=class 
org.apache.druid.segment.SegmentMissingException, 
exceptionMessage=No results found for segments[[SegmentDescriptor{interval=2020-04-19T00:00:00.000Z/2020-04-19T01:00:00.000Z, version='2022-04-11T17:18:50.095Z', partitionNumber=0}]], 
query={
    "queryType": "timeBoundary",
    "dataSource": {
        "type": "table",
        "name": "our_datasource"
    },
    "intervals": {
        "type": "intervals",
        "intervals": [
            "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
        ]
    },
    "bound": "minTime",
    "filter": null,
    "descending": false,
    "granularity": {
        "type": "all"
    }
}, peer=xx.xx.xx.xx} 

(org.apache.druid.segment.SegmentMissingException: No results found for segments[[SegmentDescriptor{interval=2020-04-19T00:00:00.000Z/2020-04-19T01:00:00.000Z, version='2022-04-11T17:18:50.095Z', partitionNumber=0}]])

As the segment is past retention time, that's when we started checking on Segments page as mentioned above and the sys.server_segments table.

Kindly help us resolve this issue. Please let us know if you need further details.

mounikanakkala commented 2 years ago

Facing the same issue again.

Observations Druid console segments UI page with every refresh within seconds shows different list of segments.

My understanding is segments UI page shows results of sys.segments. Can you please add which process and how often creates or refreshes sys.segments information?

Result of the below query

select * from sys.segments
where segment_id like 'our_datasource_2022-04-09T05:00:00.000Z_2022-04-09T06:00:00.000Z%'
order by partition_num

But http://:8081/unified-console.html#segments UI page does not show any segment for 2022-04-09.

Also verified in Metadata store that segments are no longer in used state.

select used from druid_segments
where id like 'our_datasource_2022-04-09T05:00:00.000Z_2022-04-09T06:00:00.000Z%'

If they are not published (which means used=0 in metadata store), what does is_available = true mean? Where is is_available information coming from? I suppose it is historical but where within a historical?

Another point to add is, following query returns bunch of historicals that has all the 12 partitions of 2022-04-09T05/2022-04-09T06

select * from sys.server_segments
where segment_id like 'our_datasource_2022-04-09T05:00:00.000Z_2022-04-09T06:00:00.000Z%'
order by segment_id

tanisdlj commented 2 years ago

Happening to us too

mounikanakkala commented 2 years ago

Happening to us too

@tanisdlj Thank you for sharing. Can you please share the Druid version that you are running? Just want to confirm if this started happening in the new version. Also, may I know if you are running your Druid cluster on Kubernetes?

tanisdlj commented 2 years ago

@mounikanakkala 0.22.1, running on hosts, not containers

mounikanakkala commented 2 years ago

Hi Team,

May I know if there is any update on this one?

mounikanakkala commented 2 years ago

Will this resolve with version upgrade?

OurNewestMember commented 1 year ago

Coordinator is worth focusing on. Why?

segments may not be dropped: coordinator duty to mark used
- ...although historical to execute it...although coordinator can affect health of historical based on, eg, load/drop workload including segment balancing...and so on and so on...
inconsistent query results on broker: its metadata can be intensive and has reliance on coordinator
- (obviously impacted by broker performance/workload itself)
overall historical load: can be heavily dependent on coordinator...could prevent proper segment unloading
- (eg, loading/dropping segments, even coordinator -> poor ingestion -> suboptimal segments -> more query workload, etc, etc)
- same as "segments may not be dropped" above...but this is from "point of view" of historical (especially relevant for segment missing query exception moreso than inconsistencies in the segments druid console page)
it touches the metadata datastore which can be an effective way for something like ingest (eg, heavy ingests, compaction, etc) to stall the coordinator (eg, you could heavily fragment an RDBMS with heavy ingest)

So "problem on historical" also a appears very good candidate here. However, the "inconsistencies" (in/across time...and in space: like on different historicals and pods, with different segments, different queries, upon different browser refreshes -> possibly calls to different brokers, etc) demand more commonality between the failures rather than "persistent set of coincidences" as the explanation (of course "persistent coincidences" not impossible). So I'd look at the coordinator as a relevant commonality. (...And of course coordinator can be affected by other cluster activity, like heavy ingest destabilizing the overlord running on the same hardware or the metadata store which is shared with the coordinator...all things are connected)

The point of mentioning all of this is because an upgrade may not fix a problem like this one. (Actually it could make it worse -- sometimes it happens, like potentially around 2021-12 [some new feature side effect causing much higher resource requirement for a recently enhanced in-memory column info, IIRC] and maybe also around 2022-10 [massive increase in heap requirements for streaming and batch indexing]...upgrade problems are pretty understandable with a large, complex system). Not saying you shouldn't upgrade -- just saying that regardless of that, the system could be running too close to some limits for your needs, for example. If so, the info above is about examining wherever that gap may live between desired and actual performance.

Also asking these questions to figure out if this ticket could be resolved via config or if it can support more specific design discussions (eg, if it's a configuration limitation, then maybe this issue provides test cases, etc for something like this: https://github.com/apache/druid/issues/10606)

Some questions worth mulling over...

How many segments in the cluster? (best to breakdown by used/unused...because that affects coordinator workload, plus the possibly very highly relevant workload of the overlord if sharing resources for computation/network/state/etc)

How smooth is ingest workload (demand) and performance (actual)? (Also consider compaction, even kill tasks, etc)

Any general observations related to stability and performance? (eg, dying processes, failed ingest tasks, slow publish times, indexing error messages about retries/errors in HTTP calls, ongoing logs/alerts on throttling segment balancing, etc)

Could I achieve a similar but less disruptive effect of dropping the historical pod by instead calling the markUsed/markUnused APIs of the leader coordinator? ...If so, is it acceptable for queries to potentially return "partial" results (without those segments) during the time of the coordinator/historicals/brokers doing the work -- in between marking unused and marking used again?

github-actions[bot] commented 11 months ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

winsmith commented 10 months ago

This is happening to our cluster as well. Running on Kubernetes, deleting and recreating one of our four historicals fixes this temporarily, but it seems to always return until I completely drop the relevant segments and re-import the data which is annoying and takes a while. Any advice on how to at least fix if not prevent this?

I have a feeling this has something to do with compaction, as this seems to happen a lot with segments that have been recently compacted.

github-actions[bot] commented 1 month ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

Druid does not intermittently drop segments past retention time #12458

Affected Version

Description