apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Make FST BytesStore grow smoothly #12619

Open gf2121 opened 12 months ago

gf2121 commented 12 months ago

Description

Too bad we don't have a writer that uses tiny (like 8 bytes) block at first, but doubles size for each new block (16 bytes, 32 bytes next, etc.). Then we would naturally use log(size) number of blocks without over-allocating.

But then reading bytes is a bit tricky because we'd need to take discrete log (base 2) of the address. Maybe it wouldn't be so bad -- we could do this with Long.numberOfLeadingZeros maybe? But that's a bigger change ... we can do this separately/later.

From https://github.com/apache/lucene/pull/12604#discussion_r1344639608

mikemccand commented 10 months ago

Note that oal.store.ByteBuffersDataOutput takes a different and neat approach to gracefully growing: it picks an initial block size (1024 bytes by default), and appends new blocks as you write bytes, but then if it reaches 100 blocks, it "resizes" itself by doubling the block size and copying over, so that now you have 50 blocks.

So it's still O(N) amortized cost of that doubling/copying with time, and at any given moment you will not be wasting too many %tg of the bytes you've written, except at the start 1 KB block size.

dungba88 commented 10 months ago

In https://github.com/apache/lucene/pull/12624, I moved the main FST body out of BytesStore into ByteBuffersDataOutput, and BytesStore becomes only a single byte[] for the currently written node so maybe we don't need to do this?