apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 446 forks source link

Clone table should optionally allow specifying a range to clone #4123

Open FineAndDandy opened 9 months ago

FineAndDandy commented 9 months ago

Is your feature request related to a problem? Please describe. Cloning a table is overkill if a small subset of the table is all that needs to be cloned. If the need to clone is for a small fragment of data it produces a large amount of GC overhead that is not necessary when cleaning up the clone.

Describe the solution you'd like Adding an optional range to the clone table operation would allow a subset of the table to be cloned. This would limit the GC overhead to be only the files relevant to the clone.

ArbaazKhan1 commented 8 months ago

I can take a look at this

ctubbsii commented 8 months ago

I'm not so sure we should do this. This would require a new API, so it can't be done in 2.1 where it'd be most useful. Users already have the ability to clone and efficiently truncate a table. That efficiency is limited in 2.1, due to chop compactions, which go away in 3.1. In 3.1, it'd probably be better to implement support for allowing range deletion to occur on an offline table, since it doesn't need to be online for chop compactions. That would support an offline truncate, for the situations where users don't want to bring the table online and host it in order to perform the operation. For the elasticity branch, I believe the truncate operation can already happen on an unhosted table, so it's not needed there.