Dyalog / vecdb

A simple "columnar database" based on memory-mapped files, written in APL
MIT License
32 stars 2 forks source link

Summarize function in sharded db #4

Open e9gille opened 8 years ago

e9gille commented 8 years ago

The Summarize function in sharded databases doesn't "summarize" the individual shard results. It also attempts to re-summarize partial results on WS FULL, but I believe it is doing so incorrectly by using the same summary function as originally used on the raw data.

e9gille commented 8 years ago

Added test cases to highlight the issue:

5

mkromberg commented 8 years ago

I believe the WS FULL implemetation is currently correct, but only because the only summary functions supported are count, sum, max and min. If you needed to add avg or similar functions, you'd need to do more work. I will look at the sharding issue.

e9gille commented 8 years ago

Well, count would be incorrect as well as it should sum up the individual counts when re-summarizing. But it is buggy anyway because the groupfn takes vectors of columns as argument. I've fixed in my fork and added new functions to re-summarize.