apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.52k stars 3.71k forks source link

Add datasegment copier interface and s3 impl #17430

Open jtuglu-netflix opened 3 weeks ago

jtuglu-netflix commented 3 weeks ago

This PR creates a DataSegmentCopier interface, and corresponding S3DataSegmentCopier implementation. The goal here is to provide an alternative for those wishing to move datasegments around between clusters. These classes are used in a CLI tool for copying datasources between clusters that was similar to the older, now-deprecated migration tool and plan to release that to open-source soon as well.

This also adds the ability for these transfer tools to move segments larger than 5GB using an S3 Transfer Manager.

Description

Currently, Druid only provides a means of moving (deleting from the source) a datasegment from one deep storage location to another. This adds flexibility to copy instead, while refactoring the code between S3DataSegmentMover and S3DataSegmentCopier to be shared inside S3DataSegmentTransferUtility.

Release note


Key changed/added classes in this PR

This PR has:

kfaraz commented 3 weeks ago

Thanks for the PR @jtuglu-netflix ! Could you share some details on how you plan to use this feature?