Open retorquere opened 1 year ago
First issue I see is that you're creating a BlinkDB index on firstmail
instead of firstname
.
But I'll take a look at the query performance without indexes, BlinkDB shouldn't be 18x as slow as Loki when you search for a single key.
Oopsie.
Both indexed:
loki x 1,501,305 ops/sec ±1.24% (90 runs sampled)
blink x 1,290,563 ops/sec ±0.93% (92 runs sampled)
Fastest is loki
Both unindexed:
loki x 17,738 ops/sec ±2.96% (91 runs sampled)
blink x 1,158 ops/sec ±6.18% (82 runs sampled)
Fastest is loki
I'm also not seeing the performance in the sample from the front-page; unless I mucked it up again, I'm getting this (script below):
search should find 52 users scan x 4,240 ops/sec ±4.25% (83 runs sampled) sharding x 1,693,713 ops/sec ±1.56% (81 runs sampled) loki x 42,372 ops/sec ±1.31% (86 runs sampled) blinkdb x 933 ops/sec ±1.84% (80 runs sampled) Fastest is sharding
script:
var Benchmark = require('benchmark');
const Loki = require('lokijs')
const { createDB, createTable, insertMany, many } = require( "blinkdb")
// data at https://gist.github.com/retorquere/c71d72f63a6a79ad3d0f6cd217e71b03
const items = require('./mock.json')
async function main() {
const blinkdb = createDB()
const blinktable = createTable(blinkdb, 'users')({
primary: "id",
indexes: ["age", "name"],
});
await insertMany(blinktable, items)
const DB = new Loki('better-bibtex', {})
const coll = DB.addCollection('citekeys', {
indices: [ 'id', 'age', 'name' ],
unique: [ 'id' ],
})
coll.insert(items)
var suite = new Benchmark.Suite()
const names = ['Davies', 'Gosling']
const age = 24
const n = items.filter(item => item.age > age && names.includes(item.name)).length
console.log('> search should find', n, 'users')
const shards = new Map
for (const user of items) {
if (!shards.has(user.name)) {
shards.set(user.name, [ user ])
}
else {
shards.get(user.name).push(user)
}
}
suite
.add('scan', function() {
if (items.filter(item => item.age > age && names.includes(item.name)).length !== n) throw new Error('')
})
.add('sharding', function() {
let found = []
for (const name of names) {
found = found.concat((shards.get(name) || []).filter(user => user.age > age))
}
if (found.length !== n) throw new Error('')
})
.add('loki', function() {
if (coll.find({ name: { $in: names }, age: { $gt: age } }).length !== n) throw new Error('')
})
.add('blinkdb', {
defer: true,
fn: async function(deferred) {
if ((await many(blinktable, { where: { name: { in: names }, age: { gt: age } }, })).length !== n) throw new Error('')
deferred.resolve()
}
})
.on('cycle', function(event) { console.log('> ' + String(event.target)) })
.on('complete', function() {
console.log('> Fastest is ' + this.filter('fastest').map('name'));
})
.run({ async: true });
}
main()
And if I'm looking for names that are not in the dataset:
search should find 0 users scan x 5,207 ops/sec ±4.94% (87 runs sampled) sharding x 13,567,067 ops/sec ±1.86% (87 runs sampled) loki x 1,055,781 ops/sec ±1.19% (87 runs sampled) blinkdb x 1,261 ops/sec ±1.72% (77 runs sampled) Fastest is sharding
but not only is loki faster in both cases, a table scan is also faster.
And for 100.000 items (as per the case on the front page):
search should find 3 users scan x 1,139,156 ops/sec ±1.15% (88 runs sampled) sharding x 3,938,455 ops/sec ±3.28% (84 runs sampled) loki x 255,922 ops/sec ±0.82% (90 runs sampled) blinkdb x 119,386 ops/sec ±1.09% (79 runs sampled) Fastest is sharding
Is the benchmark from the front page in the benchmark suite? I can't find it but I'd love to tinker around with it. My numbers don't align with that graph at all, so there must be something relevantly different between how these tests are set up.
(index) | Task Name | ops/sec | Average Time (ns) | Margin | Samples |
---|---|---|---|---|---|
0 | 'scan' | '505' | 1978472.635444445 | '±1.45%' | 253 |
1 | 'partitions' | '386,787' | 2585.400346700469 | '±1.11%' | 193440 |
2 | 'loki' | '8,665' | 115399.97140832458 | '±0.63%' | 4333 |
3 | 'blinkdb' | '140' | 7106147.1843383685 | '±3.94%' | 71 |
Thanks for taking a look at this :) If you check the official benchmarks for BlinkDB, do you get similar results there (with BlinkDB consistently being slower than alternatives) ? The blinkdb
benchmarks only compare my DB to LokiJS, but in my test runs BlinkDB generally reaches 0.8x - 1.2x of the performance of LokiJS, depending on the testcase.
I'll try implementing your testcase within the BlinkDB benchmarks from scratch, then we can see if there's any issues on either side.
Is the benchmark from the front page in the benchmark suite?
Not really, I either did that manually a few versions ago or deleted it from the benchmarks in the last refactor. In either case, I should add it back in and update the values on the front page.
I added a benchmark myself, available on the feature/add-front-page-test
branch here: https://github.com/blinkdb-js/blinkdb/blob/feature/add-front-page-test/packages/benchmarks/src/benchmarks/blinkdb/front-page.ts You can run it by cloning the repo, navigating to the benchmark package, and running npm run start -- "blinkdb/front-page.ts"
.
My results look like this for 100.000 items in the DB:
blinkdb/front-page.ts --- map is 1.43x faster than scan
┌─────────┬───────────┬────────────────────┬────────────────────┬───────────┬─────────┐
│ (index) │ name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼────────────────────┼────────────────────┼───────────┼─────────┤
│ 0 │ 'map' │ 318.43057225700716 │ 3140401.981229661 │ '±2.76%' │ 160 │
│ 1 │ 'scan' │ 221.94537258402815 │ 4505613.198226972 │ '±2.49%' │ 111 │
│ 2 │ 'blinkdb' │ 105.78578801070266 │ 9453065.660378002 │ '±10.83%' │ 53 │
│ 3 │ 'lokijs' │ 79.3651495976927 │ 12599988.849880174 │ '±9.05%' │ 40 │
└─────────┴───────────┴────────────────────┴────────────────────┴───────────┴─────────┘
Comparing our respective implementations, a few things I noticed:
createDB({ clone: false });
. This doubles the performance of BlinkDB in this specific benchmark for me.scan
is particularly slow, and your map
/partitions
particularly fast compared to mine. I haven't found a specific reason for this yet. Maybe you can gain other insights or discover issues looking through the code of my benchmark?
Testing with your dataset gets me a benchmark more comparable to yours, but BlinkDB is somehow still faster on my side.
blinkdb/front-page.ts --- map is 42.64x faster than lokijs
┌─────────┬───────────┬────────────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼────────────────────┼────────────────────┼──────────┼─────────┤
│ 0 │ 'map' │ 1489335.8240912817 │ 671.4402378725764 │ '±1.76%' │ 744668 │
│ 1 │ 'lokijs' │ 34925.248617732905 │ 28632.580713892505 │ '±1.07%' │ 17463 │
│ 2 │ 'blinkdb' │ 29151.137594377647 │ 34303.978593029904 │ '±3.88%' │ 14576 │
│ 3 │ 'scan' │ 2969.203635917305 │ 336790.6424144804 │ '±0.64%' │ 1485 │
└─────────┴───────────┴────────────────────┴────────────────────┴──────────┴─────────┘
How do you run this? I'm getting
Cannot use import statement outside a module
when I start it with npx ts-node
.
If you check the official benchmarks for BlinkDB, do you get similar results there (with BlinkDB consistently being slower than alternatives) ?
I haven't been able to get them to work yet.
and running
npm run start -- "blinkdb/front-page.ts"
.
That gets me
Unknown file extension ".ts" for /Users/emile/github/blinkdb/packages/benchmarks/src/index.ts
- Both BlinkDB and LokiJS have the option to clone returned objects. In BlinkDB this is turned on by default, in LokiJS off by default.
That is true. But when I turn clone off for blinkDB, my tests continue to show everything faster than blinkdb.
- Your
scan
is particularly slow,
It's supposed to be a benchmark for the worst case. Nothing should be slower than a full scan.
and your
map
/partitions
particularly fast compared to mine. I haven't found a specific reason for this yet.
Javascript Map is just ridiculuously fast, and after selection of the partition it's all integer comparisons. It's doing half the work, and the fast half at that.
I was working on a lokijs partition
command where you could select one partition and then filter on the much smaller resultset as usual. But lokijs has not been active for a long while, and I'd much prefer to move to blinkdb if we can get this apparent disparity sorted out.
Testing with your dataset gets me a benchmark more comparable to yours, but BlinkDB is somehow still faster on my side.
Well now... I've gotten this script to run and I'm now getting
(index) | Task Name | ops/sec | Average Time (ns) | Margin | Samples |
---|---|---|---|---|---|
0 | 'scan' | '503' | 1987200.3532576538 | '±0.44%' | 252 |
1 | 'map' | '306,251' | 3265.293322546093 | '±2.15%' | 153126 |
2 | 'lokijs' | '9,114' | 109709.38964793888 | '±1.07%' | 4558 |
3 | 'blinkdb' | '20,498' | 48784.63455380463 | '±1.24%' | 10250 |
And this is the same script with clone
on on both:
(index) | Task Name | ops/sec | Average Time (ns) | Margin | Samples |
---|---|---|---|---|---|
0 | 'scan' | '483' | 2066123.4032791948 | '±0.46%' | 243 |
1 | 'map' | '511,727' | 1954.164901172087 | '±2.96%' | 255864 |
2 | 'lokijs' | '1,874' | 533549.3187751692 | '±0.59%' | 938 |
3 | 'blinkdb' | '18,850' | 53049.23583965835 | '±0.84%' | 9426 |
How does cloning make such a big difference when a search returns so few records? There is no cloning during the search right, just for the hits that are returned from the search?
and your map/partitions particularly fast compared to mine. I haven't found a specific reason for this yet.
I would imagine that partitions could implemented as hooks on insert/update/create/delete that would maintain T[]
records in a Map keyed on the partition field values, and that there would be an equivalent for get
that would return that partition T[]
. These are not as flexible as general query methods (you can use them only at the start of the query chain) but given the results in these benchmarks they would provide a massive performance benefit where you can use them (and I have such a use case).
and running npm run start -- "blinkdb/front-page.ts" .
I can just run "npm run test" on the toplevel -- that works.
(index) | Task Name | ops/sec | Average Time (ns) | Margin | Samples |
---|---|---|---|---|---|
0 | 'scan' | '456' | 2189178.0046982015 | '±1.19%' | 229 |
1 | 'partition' | '26,279' | 38052.23829942207 | '±1.35%' | 13140 |
2 | 'lokijs' | '1,697' | 589064.4982071029 | '±0.76%' | 849 |
3 | 'blinkdb' | '14,259' | 70128.07589656506 | '±0.93%' | 7130 |
edit: I had previously overlooked that I still needed to clone. With the clone overhead partitioning still comes out ahead, but the performance boost varies between 85% and 1300% depending on the size of the result set (smaller being faster regardless of the size of the DB being searched) rather than the consistent several orders of magnitude I was seeing. I'm not sure how I would make dynamic partition maintenance work without cloning. Loki seems to take a really hard hit from cloning. AAMOF I can substantially improve lokijs performance just by replacing its clone with rfdc. Still not as fast as blinkdb, but usually twice as fast as with lokijs's own clone.
Is there a general hook for "each deleted(/inserted/updated) entry" regardless of whether it was called through removeMany/remove/removeWhere?
I must admit your use of typescript is a lot more advanced than mine; what I had envisioned was something along the lines of
const items = await many(userTable, {
where: {
name: { partition: ["Alice", "Charlie"] },
age: { gt: 24 },
},
});
but I've run aground in adding this to the query evaluation.
Is it correct that get
is synchronous? I could write my own wrapper to implement one
and many
to get sync reads?
If you need an array of items from the DB that is regenerated whenever items are inserted/updated/deleted, you don't need to implement partitions, you can just use watch
. This gets you the fastest possible speed on read operations with small performance hits on every write.
How does cloning make such a big difference when a search returns so few records? There is no cloning during the search right, just for the hits that are returned from the search?
Can't speak for LokiJS, but BlinkDB only clones items returned from a call to get
.
Is there a general hook for "each deleted(/inserted/updated) entry" regardless of whether it was called through removeMany/remove/removeWhere?
watch(...)
doesn't give you the "changed" items (yet), but it does give you all items that match a given query. It's basically a simplified use(...)
hook.
Is it correct that get is synchronous? I could write my own wrapper to implement one and many to get sync reads?
Yes. The entire query engine is synchronous. BlinkDB is just async in order to support async middleware.
If you need an array of items from the DB that is regenerated whenever items are inserted/updated/deleted, you don't need to implement partitions, you can just use
watch
. This gets you the fastest possible speed on read operations with small performance hits on every write.
That's not entirely the same though. With the partition solution, you get the speedup without knowing the names in advance. The watch
as used here only works if you know up front which names you're going to select. I need something where the names can be arbitrary. As far as I can tell, this watch
setup would only work if I rebuilt the partitions for every mutation, which is going to be prohibitive for large tables. The map approach I posted only makes one or two changes to the map for each mutation, rather than rebuilding it.
watch(...)
doesn't give you the "changed" items (yet), but it does give you all items that match a given query. It's basically a simplifieduse(...)
hook.
I do need the changed/deleted/added items though, as I use them in my own event handling. How would I implement my use
on a removeWhere? I would have to run the query again before the call to ctx.next
I suppose?
Yes. The entire query engine is synchronous. BlinkDB is just async in order to support async middleware.
Gotcha -- that is not a problem for me, I just need read/search to be sync, so I can start working on blink integration.
I'm trying to understand how the query infrastructure works, is there any documentation on the design of it?
I do need the changed/deleted/added items though, as I use them in my own event handling. How would I implement my use on a removeWhere? I would have to run the query again before the call to ctx.next I suppose?
I think it best to implement this in watch
already. The callback could simply provide a diff
argument which either contains the elements that were added, updated, or deleted that triggered the watch hook.
If you want to implement this using removeWhere
, you could use the query engine yourself (the get
, although I don't know if that API is publicly exposed yet) to retrieve all items that match the where filter, then perform some operation on them, then pass them to ctx.next
.
I think it best to implement this in
watch
already. The callback could simply provide adiff
argument which either contains the elements that were added, updated, or deleted that triggered the watch hook.
That would work.
If you want to implement this using
removeWhere
, you could use the query engine yourself (theget
, although I don't know if that API is publicly exposed yet)
Sort of -- I can fetch it using import { get } from 'blinkdb/src/query'
-- edit: but not without a host of errors. I really do need to be able to use get
.
to retrieve all items that match the where filter, then perform some operation on them, then pass them to
ctx.next
.
I'd much prefer the former option :)
Looks like in the interim I could just subscribe to the dispatcher using
table[BlinkKey].events.onInsert.register((changes) => console.log(changes))
table[BlinkKey].events.onUpdate.register((changes) => console.log(changes))
table[BlinkKey].events.onRemove.register((changes) => console.log(changes))
table[BlinkKey].events.onClear.register((changes) => console.log(changes))
correct?
Me again -- I took your last script and modified it (to avoid screwups like before) for a new case I'm working on, script + data are here, but I'm getting this for single-field indexed lookup:
┌─────────┬───────────┬──────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼──────────┼───────────────────┼──────────┼─────────┤
│ 0 │ 'lokijs' │ '18,257' │ 54772.43624664317 │ '±0.51%' │ 9129 │
│ 1 │ 'blinkdb' │ '13,989' │ 71482.33448462115 │ '±0.25%' │ 6995 │
└─────────┴───────────┴──────────┴───────────────────┴──────────┴─────────┘
For m: { in: ['math', 'text'] }
the difference is even more pronounced:
┌─────────┬───────────┬──────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼──────────┼────────────────────┼──────────┼─────────┤
│ 0 │ 'lokijs' │ '20,296' │ 49269.873422405995 │ '±0.51%' │ 10149 │
│ 1 │ 'blinkdb' │ '11,252' │ 88871.98824261357 │ '±0.38%' │ 5627 │
└─────────┴───────────┴──────────┴────────────────────┴──────────┴─────────┘
I've managed to get the benchmarks to run; I've taken the front-page.ts benchmark and modified it to this, so I'm pretty sure I got it right this time:
import { createDB, createTable, insertMany, many } from "blinkdb";
import loki from "lokijs";
import { Bench } from "tinybench";
interface Char {
id: number;
u: string;
s: string;
t: string;
m: string;
p: string;
}
const blinkdb = createDB({ clone: false });
let blinkUserTable = createTable<Char>(blinkdb, "users")({
primary: "id",
indexes: [ 'u', 's', 't', 'm', 'p' ],
});
const lokidb = new loki("users.db");
let lokiUserTable = lokidb.addCollection<Char>("users", {
unique: ["id"],
indices: [ 'u', 's', 't', 'm', 'p' ],
});
let chars = require('./config2.json')
chars.forEach((char: Char, i: number) => char.id = i)
lokiUserTable.insert(chars);
insertMany(blinkUserTable, chars);
export const bench = new Bench()
.add("lokijs", () => {
lokiUserTable.find({ m: { $in: ["m", "t"] } });
})
.add("blinkdb", async () => {
await many(blinkUserTable ,{ where: {
m: { in: ["m", "t"] },
}});
});
and there lokijs gets over twice the performance of blinkdb:
┌─────────┬───────────┬────────────────────┬────────────────────┬───────────┬─────────┐
│ (index) │ name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼────────────────────┼────────────────────┼───────────┼─────────┤
│ 0 │ 'lokijs' │ 4495.033687905337 │ 222467.74316523422 │ '±4.68%' │ 2248 │
│ 1 │ 'blinkdb' │ 1990.6891247726794 │ 502338.6060413586 │ '±12.03%' │ 996 │
└─────────┴───────────┴────────────────────┴────────────────────┴───────────┴─────────┘
with front-page.ts
I get the expected
┌─────────┬───────────┬────────────────────┬────────────────────┬───────────┬─────────┐
│ (index) │ name │ ops/sec │ Average Time (ns) │ Margin │ Samples │
├─────────┼───────────┼────────────────────┼────────────────────┼───────────┼─────────┤
│ 0 │ 'map' │ 237.99901467945804 │ 4201698.067308473 │ '±1.89%' │ 119 │
│ 1 │ 'scan' │ 179.23980238634888 │ 5579117.956426408 │ '±3.32%' │ 90 │
│ 2 │ 'blinkdb' │ 86.9096681550792 │ 11506199.727004224 │ '±15.91%' │ 44 │
│ 3 │ 'lokijs' │ 47.70440453076051 │ 20962424.95502035 │ '±12.22%' │ 24 │
└─────────┴───────────┴────────────────────┴────────────────────┴───────────┴─────────┘
with blinkdb getting almost twice the performance of lokijs.
I'm looking to switch to blinkdb for its typescript support, but when I benchmark indexed search, blinkdb comes out substantially slower than lokijs:
(where mock.json has this content)
returns
and even when I turn off indexes in loki, it still comes out ahead:
I haven't tested more complex searches but search by single indexed field is something I'd be doing a fair bit of. Is this just not a good use-case for blinkdb?