ccorcos / tuple-database

409 stars 20 forks source link

Some questions about tuple-database implementation and design #13

Closed quolpr closed 1 year ago

quolpr commented 2 years ago
  1. For each field that I need to query, I must create an index, right?
  2. Also, indexes become expensive it terms of disk usage at some point, how are you going to deal with it? Or it is by-design
  3. How can I store values that are not meant to be declared in index? To keep index size not so big, for example

Also, I see that all examples are done in synchronous way, so:

  1. Do you load full db into memory at first startup? If it so, don't you're afraid that first startup time will be increasing while dataset will be growing?
  2. How can I be sure, that transaction is actually persisted to the disk (or IDB)?

Also, is it by design to look like a low-level db? If so, do you plan to make more abstractions on top of it?

Btw, from the first point of view it was hard for me to understand what tuple-database do πŸ˜… README is a little hard to understand, it's more engineer-specific and low-level for me. It's not a critique, just feedback. Maybe it will be helpful for you πŸ™‚

quolpr commented 2 years ago

Also, don't you think to write the core in Rust? I think for such low-level thing, you will have a big performance gain (at least you will not have GC spikes). I was writing binding to the wasm version of sqlite, and it was pretty easy to do.

You were mentioning only the storage engine, but not the db itself πŸ™‚

Also, your db could be used then not only in JS community

quolpr commented 2 years ago

Also, regarding storage engine β€” if you need, I may give you some advices how it could be done with wasm. I am writing the bindings to wasm asincify SQLite with async IDB backend support, so have some experience πŸ™‚

ccorcos commented 2 years ago

For each field that I need to query, I must create an index, right?

Sort of. I would say you want to have an index for every query. Translating from a SQL perspective, its more like "an index is a query". SQL has a query planner which will intelligently figure out what indexes to use (and as we discussed in a previous thread, it's not entirely clear to people what it's doing ;). Tuple Database doesn't have a query planner -- you, the developer, are the query planner. So you can construct indexes and choose how to fetch whatever information you want...

Also, indexes become expensive it terms of disk usage at some point, how are you going to deal with it? Or it is by-design

That's the case with SQL as well though. When I was at Notion, we had maybe 30 tables and 60-90 indexes. So in addition to primary keys, you're looking at about 2-3x the amount of data storage. Granted these aren't all full covering indexes... But my point is that, underneath SQL is a system very similar to tuple-database. SQL isn't magic -- it's going all this same stuff, writing data multiple times in different places, etc. And so the disk usage problem is the same as any other database. Only difference is that I haven't implemented any compression or anything like that yet... But storage is cheap - I'm not worried about it yet.

How can I store values that are not meant to be declared in index? To keep index size not so big, for example.

Set it as a value: tx.set(key, value)

I see that all examples are done in synchronous way

That's just for convenience. You can do async stuff too.

const db = new AsyncTupleDatabaseClient(new AsyncTupleDatabase(new LevelDbTupleStorage("test.db")))
await db.scan(...)
const tx = db.transact()
tx.set(key, value)
await tx.commit()

is it by design to look like a low-level db?

Yes, it is intended to be a database primitive. Similar to FoundationDb (or DynamoDb, but mostly inspired by FoundationDb).

do you plan to make more abstractions on top of it?

Sort of. The beauty of this database is you can easily build your own abstractions that are thin wrappers around the lowest level. The FoundationDb tutorial on class scheduling is a fairly simple example that walks you through how transactions work, and how to compose logic: https://apple.github.io/foundationdb/class-scheduling.html

And I've created a test that recreates this whole example in tuple-database: https://github.com/ccorcos/tuple-database/blob/master/src/examples/classScheduling.test.ts

For example, the ability to transactional "switch classes" while enforcing constraints on the max number of students in a class, the max number of students in a class, a student cannot sign up for the same class twice, etc... All of that logic can be transactionally composed into a single method that allows a student to transactionally switch classes. Implementing this kind of logic in SQL is a pain. And all I have to do is compose two functions that I've already written:

const switchClasses = transactionalReadWrite<SchoolSchema>()(
    (tr, student: string, classes: { old: string; new: string }) => {
        drop(tr, student, classes.old)
        signup(tr, student, classes.new)
    }
)

So my point there is that I can build a SQL layer on top of tuple-database, but why would I do that? I could make my own query language DSL, but why do that either? I find it much more powerful and flexible to use the underlying abstractions directly...

That said, I am keen on using this database to model triplestores. Similar to RDF / Datomic / Datalog, triple stores are a powerful abstraction for modeling very flexible data. And do I have built up some abstractions on top of this model. But notice those abstractions are all just functions for reading and writing...

https://github.com/ccorcos/tuple-database/blob/master/src/examples/triplestore.ts

don't you think to write the core in Rust? I think for such low-level thing, you will have a big performance gain (at least you will not have GC spikes).

Totally. I'd love to hire someone to do that. But it's not my main skillset and at the end of the day, this is premature optimization... The thing is "pretty fast", or rather it's fast enough... There's also all kinds of weird issues like "can I store a function callback as a value?" Right now you can, because its all JavaScript. Also the interop overhead -- right now all numbers are float64 and that's pretty simple and convenient. In SQLite, all numbers come out as strings!

regarding storage engine β€” if you need, I may give you some advices how it could be done with wasm.

I'd love some help! In particular a plaintext (ideally) flat-file storage engine would be great. Since all of the transactional guarantees are outside of the storage engine, it can be implemented fairly simply. I also really want to implement a Generalized Search Tree (GiST) so that we can support spatial queries. In particular segment trees would be really useful for reactivity performance and interval trees can be used for date range queries. http://gist.cs.berkeley.edu/gist1.html

I am writing the bindings to wasm asincify SQLite with async IDB backend support, so have some experience πŸ™‚

There are reasons why async SQLite is not a great idea. SQLite is serialized and synchronous. And so an async api doesn't gain you much. https://github.com/WiseLibs/better-sqlite3#why-should-i-use-this-instead-of-node-sqlite3

That said, if you're building your own thing on top of SQL dialect, you can have different kinds of concurrency. That said I'm reminded of this interview about engineering process at SpaceX: https://twitter.com/ccorcos/status/1423789077837479942

Optimize last! That's the biggest lesson... Why do you need async SQLite with an IDB backend??

quolpr commented 2 years ago

@ccorcos thanks for the answering! I like your ideas. They all make sense. And especially not need to use SQL, but give ability users to build their own optimized (cause you know what should be optimized, and indexes could be more flex) and typesafe (that's very cool!) queries.

Optimize last! That's the biggest lesson... Why do you need async SQLite with an IDB backend??

That's a good question πŸ˜„ I was needed in DB that will be fast and not in-memmory. I used Dexie before (wrapper around IDB), but it had performance bottleneck by IDB itself β€” which is slow. And it's crashes/has unexpected behavior very often.

After this post https://jlongster.com/future-sql-web I managed to use absurd-sql, and I took x3-x5 performance gain for the same queries! Reading blocks(ArrayBuffer) from IDB and consuming it by sqlite has much better performance, then just storing business data in IDB itself. And, due to just reading/writing of ArrayBuffer pretty simple operation, unexpected behavior happens rarely.

But abusrd-sql has one downside β€” you need to use COOP header ( https://web.dev/i18n/en/coop-coep/ ) that restricts usage of embeded iframe from the other sources (like youtube videos). But I need the ability to embed iframe in the app. That's why I'm building asyncify version and trying to achieve the same performance as absurd-sql has.

Optimize last!

I agree πŸ™‚ But I'm working on the product that already has some performance problems, and IDB is a bottleneck.

quolpr commented 2 years ago

And, actually, if you need good performance on web with persistencyβ€” don't rely on abstractions of IDB. It's better to build your custom binary format to store data in blocks (like SQLite do)

But if you are actually need it πŸ™‚

quolpr commented 2 years ago

Oh, we expect that our DB will be growing fast (we are building a note-taking app), and for the note-taking app it's critical to have a fast startup time β€” that's why in-memmory db is not the case

ccorcos commented 2 years ago

And, actually, if you need good performance on web with persistencyβ€” don't rely on abstractions of IDB. It's better to build your custom binary format to store data in blocks (like SQLite do)

What makes you think adding additional layers of abstraction will be more performant than using raw IDB abstractions?

quolpr commented 2 years ago

What makes you think adding additional layers of abstraction will be more performant than using raw IDB abstractions?

From the article https://jlongster.com/future-sql-web :

image

image

And such performance is achieved by raw blocks reading/writing and abstraction over it (abstraction means how SQLite (for ex) store data in those blocks), which is faster than IDB rows getting/putting (which is absurd!).

And the good thing, if you will bring this low-level abstraction, you will also can easily port to NodeJS and writing to actual file, not IndexedDB (for ex) block by block.

But, of course β€” it's only your decision how to make it. Just want to give you some hints πŸ™‚ And maybe for the current stage you don't need this, and it will be premature optimization. It just depends on what goals you want to achieve.

Or I didn't get the questions πŸ™‚

ccorcos commented 2 years ago

Interesting. So it all comes down to avoiding IDB's transactional read/write logic...