Async data sources - Githubissues

Gozala commented 5 months ago

Our current query interface is synchronous which was unfortunate oversight on my end since clearly we can't block while fetching data over the wire.

@relves proposed changes in #22 that was fixing it, but I was naive thinking that query planner could prefect all the records ahead of time and then just run query over them. I am realizing now that in practice it would be issuing set of concurrent requests run queries over them to narrow down possibilities and issue another set of requests and repeat.

We could probably learn from datomic, datascript and datahike here, but we do need to get this rolling soon.

Gozala commented 5 months ago

@relves would you be interested in having a call to discuss the way could collaborate on this ? I feel really bad not being very helpful. I think it would be a good idea to converge on list of constraints stores need to work with and from there decide how much query planner could do without becoming too complicated.

relves commented 5 months ago

Hi @Gozala, all good, this is a learning/curiosity project not critical path. :) When you have the time I'd be interested in discussing over a virtual coffee. Connecting with you on discord.

Gozala commented 1 month ago

@relves here is the version I end up implementing for now https://github.com/Gozala/datalogia/pull/32

I think in offers a reasonable compromise in terms of:

Supporting async stores without figuring out a sophisticated query planner.
Allows sync stores without forcing users to use async interface

I also recall you were interested in making actual persistent store. I have implemented this proof of concept that I think could be generalized to most KV stores that support range scans.

relves commented 1 month ago

Oh excellent! I'll take a deeper look in the morning but looks like we are thinking along the same lines as my Database poc also used okra. Very cool. Can't wait to take a look at how you achieved the async wrapper and your Synopsys implementation. Cheers

relves commented 1 month ago

Your Task implementation with generators is next level @Gozala. Brain is hurting. 🤯 Also had a look at how you wrapped okra in Synopsys. Still looking at 'cause' and its intended use, and will be curious to understand how query subscriptions work.

My guess is you were thinking the same, that stacking datalog query support over a prolly-tree backed db that can be efficiently synced across peers could be useful?? Anyway, when I get more time I want to pull in Synopsys and see what's involved in p2p sync.

Gozala commented 1 month ago

Also had a look at how you wrapped okra in Synopsys. Still looking at 'cause' and its intended use

At the moment it simply points to the previous head of the db, kind of what datomic does with transaction identifier except instead of vector clock using merkle references to establish order.

and will be curious to understand how query subscriptions work.

At the moment it's pretty basic, just maintains list of queries and on transaction reruns each notifying subscribers if results are different. I hope to implement something like DBSP in the future.

Gozala commented 1 month ago

My guess is you were thinking the same, that stacking datalog query support over a prolly-tree backed db that can be efficiently synced across peers could be useful??

I really would like to implement datomic inspired architecture except use prolly trees instead of B-tree to be agnostic of insertion order. I also hope that datum segments could be efficiently encoded with fressian and represent branches of the prolly tree.

My hypothesis is that it could enable allow partial on demand sync, but at this point it is stack of hypothesis that I hope will work out.

relves commented 1 month ago

I like the ideas, and I understood up until how fressian encoded datums for branches plays into this. I'll have to do my homework.

Gozala / datalogia

Async data sources #26