Closed AKarbas closed 1 year ago
we're using the same version. And several months ago, we deleted more than 300K segments without any error.
Have you checked the available memory on your server ? Could you attach the full task log here ?
Hi @AKarbas, the kill task first retrieves all segments to kill in memory, so the OOM error probably was because of too many segments compared to the max Java heap size. You can either increase the max heap size or reduce the number of segments per kill task.
BTW, have you considered setting druid.coordinator.kill.on
? The coordinator will issue kill tasks automatically. See https://druid.apache.org/docs/latest/configuration/index.html#coordinator-operation for more details.
@AKarbas I opened https://github.com/apache/druid/issues/10002. It could help in your use case in the future.
Hi @AKarbas, the kill task first retrieves all segments to kill in memory, so the OOM error probably was because of too many segments compared to the max Java heap size. You can either increase the max heap size or reduce the number of segments per kill task.
Hi @jihoonson, That makes sense. I have tried reducing the number of segments by specifying small(er) time periods and that worked. I haven't tried increasing the max Java heap size. IIRC, that's a peon configuration, and I'd rather not increase java heap size for all other tasks that are working fine. So, increasing heap size is a last resort kinda thing.
BTW, have you considered setting druid.coordinator.kill.on? The coordinator will issue kill tasks automatically.
This should do it. With the druid.coordinator.kill.maxSegments
configuration, I could can make it do the whole thing in multiple runs.
Thanks. (:
@FrankChen021 :
Haven't checked the server memory on task run, but I recall having enough memory on the server for the number of tasks and task memories configured. Will double check it though.
Have you checked the available memory on your server ?
@FrankChen021: I tested the memory usage.
It is configured to use 1GB of heap and up to 1.5GB of direct memory (as you can see in the issue description).
The metrics emitted by druid show no more than about 200MB of heap of non-heap memory usage (each), but looking at server metrics not emitted by druid, the task used about 1.8 GB in total and that much was freed after it failed. Also, about 50% of the configured memory for the whole MiddleManager was free the whole time.
For anyone interested
You could use the DELETE API mentioned in https://druid.apache.org/docs/latest/operations/api-reference.html
For example Pass HTTP DELETE command as follows
<druid url>/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}
interval can be for example, 2020-01-01T00:00:00.000Z_2020-02-01T00:00:00.000Z
You can control how many segments you want to delete in one single run.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Hi.
I'm on Druid 0.16.0-incubating on OpenJDK 8u162-jre.
I had this overly large datasource with ~170K segments, being ingested from Kafka. The problem was the data was wrongly timestamped and on each run of the ingestion task, hundreds of segments were created if not thousands.
I realized this, and in an attempt to clean up, stopped the supervisor, dropped the datasource, and issued a kill task to clean it up all the way. -- all through the druid console.
Now, the kill task fails with this as it's last log line:
Some logged configurations (Tell me what to add.):
Any ideas? I was thinking submitting kill tasks with smaller time chunks than
1000-...
to3000-...
could reduce the number of segments to kill in each run, but shouldn't this be handled automatically?Lastly, if this in fact is a bug that still exists in the latest version, I'd be happy to submit a PR to fix it if you point me in the right direction.
Cheers