TheCandidStartup / TheCandidStartup.github.io

The Candid Startup Blog
https://www.thecandidstartup.org
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

2023/05/01/spreadsheet-snapshot-data-structures #17

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Data Structures for Spreadsheet Snapshots

I have a plan. After a round of brainstorming and benchmarking, I’ve decided to use Event Sourcing to store the sequence of operations applied to a spreadsheet. Every so often I’ll create snapshots of the current spreadsheet state. I can then load the spreadsheet at any point in time by loading a...

https://www.thecandidstartup.org/2023/05/01/spreadsheet-snapshot-data-structures.html

Wayne82 commented 1 year ago

When you say "every so often, I’ll create snapshots of the current spreadsheet state", will the new snapshot contains the same segments as the previous one besides its own new ones, or some sort of merging will happen whenever a new snapshot is created?

Is the bloom filter per segment? And what is "the key within the segment"?

In the "Let’s look at some concrete numbers" section, how the least indexed columns and rows are calculated by fitting the index into a 1MB chunk? I was thinking 64K 128 for the least columns, and 64K 2 for the least rows, but resulted in different numbers.

timwiegand commented 1 year ago

It depends on the approach used. I show three approachs in the Segments section. The first always creates a new segment with the previous segment merged in. The second contains all the previous segments as well as the one just written out. The third (which is what I intend to implement) is a hybrid where there is some merging and some reuse of previous segments.

The way I described it was a bloom filter per segment. In the case of a spreadsheet the key is the row and column index of a cell.

The minimum number of indexes colums/rows is explained in footnote 1. The index uses a single entry for a run of stripes with the same width. If you actually had a segment with 64K 128 wide stripes, you would only need 1 index entry. If all stripes are the same width you have no real limit on number of columns. The worst case is where you have alternating stripe widths. For example 128,256,128,256,... Now you need an entry per stripe and the number of columns you can fit is (128+256)*32K = 12M.