Refactor to paged feeds

ploeh commented 11 years ago

This, rather big refactoring(?) changes how the underlying storage model works. Instead of using myriads of small files (one for each Atom entry), it instead uses paged Atom feeds for chunkier access. Performance measurements (see attached images) indicate that a significant performance improvement can be gained on the read side, possibly at the cost of increased write time.

In the following graph we see that the download times significantly drop when using paged feeds instead of individual entries. However, we also see that upload times increase. The graph is a bit misleading because I have no data for the upload time for individual entries, so while it looks like the upload time explodes (from so-small-it's-not-even-visible-in-the-graph to several minutes) going from individual entries to paged feeds, this isn't necessarily the case. There's simply no data for that scenario.

The big numbers for uploads mask the difference in page sizes for download times, so the next graph shows only the download times for various page sizes. For each page size, download time with and without pre-fetching is compared. It shows a significant improvement when pre-fetching is enabled, so this feature is part of this pull request.

The final graph compares the size of the stored events. As we can see, the only significant difference is that moving from individual entry files to paged files noticeably reduces the size on disk. However, this measurement was made on my local file system, so it may be different on e.g. Windows Azure Blob Storage.

All measurements were made from my local Lenovo X1 Carbon laptop in Copenhagen, Denmark against Windows Azure Blob Storage in Western Europe. The pre-fetching feature introduces a degree of parallelism, so results may vary based on the number of processors available to the client.

sgryt commented 11 years ago

Do you have any input on why upload is that much slower than download ? I would expect it to be almost the same amounts of data going back-and-forth, and thus I would expect the durations to be similar.

sgryt commented 11 years ago

The opposing trends of the total duration for upload vs download seem strange to me. What type of bandwidth caps are on the connection to your ISP? Measurements from http://www.speedtest.net or similar may give useful clues.

ploeh commented 11 years ago

The reason why upload is slower than download is that in order to upload, the client first has to download the current index page, because it needs to append the new entry to the current page (or copy all entries to a new page).

Thus, it makes total sense that upload is always slower than download. This is also the case in the current implementation, because the client has to download the index in order to know the address of the previous entry, so that it can link the index to the previous entry.

However, in the current implementation, upload time is constant (O(2), I guess...) because there's always exactly one entry in the index.

In the new implementation, that would simply be a special case, where the page size is 1.

For page size 10, for example, adding the very first event causes 0 entries to be downloaded, and 1 entry to be uploaded. For the next event, the client downloads the index page, already containing 1 entry, and then saves the index page again, now with two entries:

This continues until the index contains 10 entries. When the client adds a new event, it downloads the index page, but since the index now contains the maximum number of entries, it copies the 10 entries to a new page, and uploads that. It also removes those 10 entries from the index page, but adds the new event, for a total of ten entries downloaded, and eleven (10 + 1) entries uploaded.

Then it starts all over again, because when a new event is added, the index page contains only a single entry.

This means that while upload time is still independent of the total number of events in the store, an upload operation now costs something like O(pagesize + 1).

ploeh commented 11 years ago

Updated, so it's now ahead of master.

ploeh commented 11 years ago

Now available on https://www.myget.org/feed/Packages/grean as AtomEventStore 0.4.0 and AtomEventStore.AzureBlob 0.4.0.

GreanTech / AtomEventStore

Refactor to paged feeds #30