Automatic horizontal scaling

gedw99 commented 2 months ago

I just read in the DB how data changes are not spread out.

NATS Jetstream would do this for you in a fault tolerant way.

https://github.com/maxpert/marmot/ does this for SQLITE using NATS Jetstream.

I dont have time to work on this right now, but hope that it helps.

nuric commented 1 month ago

Hello, that's correct. Automatic horizontal scaling with replication is on the roadmap and needs to be designed with the vector search load in mind. SemaDB can serve search requests if the shards are distributed even if a node is down but the writes are sequential on the node where the shard data lives. Currently, the healthy state of a system in production is handled by another orchestrator such as K8 as mentioned in the readme. This simplifies the distributed system design to focus on the vector search aspect for now.

We are looking for usage data and performance insights before adding a large overhead for replication that may reduce performance and complicate the codebase. We did look at several replication schemes from using Raft directly, intermediate message brokers to a service like you suggest. Using a system like marmot, rqlite etc doesn't balance many databases effectively. We ideally want different replication levels for shards as opposed to replicating a single database across many nodes.

SQlite is great but not needed for the key-value pairs SemaDB stores. You can read more about the datastore implementation in the corresponding package.

Finally, we are looking to make SemaDB complete without having to run external control services and keep a simpler internal design, you can read about the design choices in Contributing guidelines. There is a lot of machinery one can add, especially in the distributed realm, before it becomes unmaintable.

I'll keep the issue open for a while in case others want to comment.

gedw99 commented 1 month ago

Oh I think I might have mis communicated my intent.

I am not suggesting you use SqlIte with Marmot. Marmot uses NATS to send the WAL data from SQLite to the other NATS Servers, which then puts the data into each other SQLite. Thats a very very wrong way to describe , but it's just to illustrate.

I also know you CANT use SQLITE. you need vectors storage and breve as your final storage place....

I am suggesting that you can use NATS as the facade into your storage layer API. NATS will then essentially repeat that payload to all your other storage on other servers. Producer / Consumer pattern with many consumers. Of course Data clashes are perhaps needing consideration like LWW, etc...

I saw you DONT want to use RAFT, and NATS under the hood does use RAFT to get the job done.

Semafind / semadb

Automatic horizontal scaling #10