ccorcos / tuple-database

408 stars 20 forks source link

Couple of questions #15

Closed JakeCoxon closed 1 year ago

JakeCoxon commented 2 years ago

Firstly, this thing is great. This weekend I was able to make an offline-first Notion clone with reordering blocks etc. storing everything in memory. Then persisting to disk/server as it was super easy by just serialising WriteOps. Just a little side project

One thing to note though is that useTupleDatabase doesn't work in React.StrictMode, because sideeffects aren't allowed in useMemo. There are some workarounds but I think you either have extra re-renders, or you make extra queries. Here's a codesandbox with it failing

A question: If I were to query something that returned a bunch of userIds, whats the best way to query users from those userIds? I could scan 1 by 1 but I feel like there would be a better way. My thinking is that there could be a scan function that takes a list of keys that are already sorted and do a kind of group binary search with them. This should be faster than binary searching each one individually

I was also wondering what kind of state this library is in. From my short testing things seem to work fine, but do you know of any major issues? Is it usable in production as it is? I'm having fun using it

ccorcos commented 2 years ago

Firstly, this thing is great. This weekend I was able to make an offline-first Notion clone with reordering blocks etc. storing everything in memory. Then persisting to disk/server as it was super easy by just serialising WriteOps. Just a little side project

Awesome! That's what I was hoping people would be doing :)

One thing to note though is that useTupleDatabase doesn't work in React.StrictMode, because sideeffects aren't allowed in useMemo. There are some workarounds but I think you either have extra re-renders, or you make extra queries. Here's a codesandbox with it failing

Hmm... This is weird. Where are you turning strict mode on there? Seems like this should work... It's adding it to the database and getAllUsers is getting called again and returning 3 elements... but the result of useTupleDatabase has the same value every time... Not quite sure where its going wrong here: https://github.com/ccorcos/tuple-database/blob/79b6085911506dd95386cb7552e8993680cef2b3/src/useTupleDatabase.ts

I'm not calling setState within useMemo though... Basically, I want to be able to get an initial value synchronously while also subscribing. So that's what going on there...

A question: If I were to query something that returned a bunch of userIds, whats the best way to query users from those userIds? I could scan 1 by 1 but I feel like there would be a better way. My thinking is that there could be a scan function that takes a list of keys that are already sorted and do a kind of group binary search with them. This should be faster than binary searching each one individually

Hmm. That's interesting... You can basically order the keys and keep track of a min/max for the entire set when doing binary search... Here's a basic implementation -- I think there's a more efficient way of doing this by getting the first element, and then the last element, and then the items in between... Then you don't need to check and update the batchMax a bunch of times like you do in this implementation...

function batchedBinarySearch<V>(cmp: Compare<V>) {
    return function (list: V[], items: V[]): BinarySearchResult[] {
        items = [...items].sort(cmp)

        var batchMin = 0
        var batchMax = list.length - 1

        const results: BinarySearchResult[] = []
        nextItem: for (const item of items) {
            // Re-use min/max for this batch.
            var min = batchMin
            var max = batchMax
            while (min <= max) {
                var k = (max + min) >> 1
                var dir = cmp(item, list[k])
                if (dir > 0) {
                    min = k + 1
                } else if (dir < 0) {
                    max = k - 1
                    // Update upper bound
                    if (cmp(item[items.length - 1], list[k]) < 0) {
                        batchMax = max
                    }
                } else {
                    results.push({ found: k })
                    continue nextItem
                }
            }
            results.push({ closest: min })
        }

        return results
    }
}

It would take a little bit more leg work, but we'd basically be looking at a function like db.scanBatch which is a more general case of db.scan...

It's definitely going to be faster... but maybe not big-O faster. If you have 1000 users and a million posts, then fetching 20 users using this batched method should be much faster.

log (1001000) * 20 = 120 versus log(1001000) + log(1000) * 20 = 66 so about 2x faster in that case.

I was also wondering what kind of state this library is in. From my short testing things seem to work fine, but do you know of any major issues? Is it usable in production as it is? I'm having fun using it

Well I don't think anyone is seriously using this library except me right now. I've built some fairly complicated stuff with it and I think I've ironed out pretty much all of the bugs. There's a pretty solid test suite I've built for this thing...

I'm still making small breaking changes to the API here and there, but nothing too crazy. For example the last change I made was transactionalQuery was renamed to transactionalReadWrite. That stuff takes 2 seconds to fix in your codebase though.

I think the only stuff in here thats "unstable" in some sense is that I'm working on higher levels of abstraction to deal with some fairly practical problems. For example:

  1. I'm pretty annoyed by the fact that I have to write essentially duplicate code for sync/async database logic. For example, this is what I'm working with right now:
export const deleteTriples = readWrite((tx, [e, a]: [Fact[0], Fact[1]]) => {
    tx.subspace(["eaov"])
        .scan({ prefix: [e, a] })
        .forEach(({ key }) => deleteFact(tx, key))
})
export const deleteTriplesAsync = readWriteAsync(
    async (tx, [e, a]: [Fact[0], Fact[1]]) => {
        const results = await tx.subspace(["eaov"]).scan({ prefix: [e, a] })
        await Promise.all(results.map(({ key }) => deleteFact(tx, key)))
    }
)

It's really annoying... so I've been messing around with this monadic query builder abstraction: https://github.com/ccorcos/tuple-database/blob/master/src/helpers/queryBuilder.test.ts

Which would basically reduce something like this:

export const deleteTriples = ([e, a]: [Fact[0], Fact[1]]) => 
    q.subspace(["eaov"])
        .scan({ prefix: [e, a] })
        .chainAll(({ key }) => q.deleteFact(key))

execute(db, deleteTriples([x, y]))

I'm curious what you think about this actually. It feels like a bit much to introduce monadic stuff like this. chainAll is also not exactly a FantasyLand thing either... I'm not a functional programming wizard so I worry that I'm just going to hack things up... maybe it should be `.map().reduce(q.concat) or something? Either way, that's one of the things I've been thinking about.

  1. I've also been working on a caching layer, similar to Relay actually, for basically async fetching ranches from a db, but then keeping those subscriptions warm in memory and keeping track of subscriptions so that you can fetch them synchronously as well as optimistically update them... This kind of thing is super useful if you want to have an <input> that reads and writes directly to an async backend database... Even though I'm building local-first software, the Browser runtime still wants input state to update synchronously... But I think I have a good solution in my head and I'm 80% of the way there currently, though the code is in another project (happy to share if you're interested in helping out :).

This thing points to using something like query builders as well though because you might want to subscribe to a range, but then derive some value from it and the component shouldn't have to re-render unless that derived value changes...

Also something that's interesting is that you can potentially declare these queries statically and compose them together, similar to Relay fragments with GraphQL (I'm not a graphql expert btw), which would allow you to fetch everything you need upfront in a single async request... However, that's not super important to me with local-first software so I think React Suspense might actually be a really good fit for a solution here... I've been avoiding Suspense for a while though -- I'm not sold on async ui's being a good thing...


Well that was maybe more of a response than you were asking for haha. In terms of "Is it usable in production as it is?", I would say kind-of yes... If you're a big established company, it might be a bit risky. But if you're building something from scratch, especially something kind of like Notion, then I would say that this is a great foundation to use.

If I were to build Notion all over again (I was the first employee), then I would definitely use FoundationDb, but that's just because Notion is a SaaS product. If you're building a local-first app, I would use this library -- its basically an embedded version of FoundationDb with reactivity...

That said, if you build a successful product on top of this, then no doubt you're going to hit the limits and probably start contributing to it (which would be great!)... That said, if you try to build something like Notion with something else, I think you'll find out that this is actually a pretty great abstraction haha.

Anyways, I'm eager for people to use it on Greenfield projects and happy to help dogwood it 😄

Cheers,

Chet

JakeCoxon commented 2 years ago

Thanks for the thorough response!

Hmm... This is weird. Where are you turning strict mode on there? Seems like this should work... It's adding it to the database and getAllUsers is getting called again and returning 3 elements... but the result of useTupleDatabase has the same value every time... Not quite sure where its going wrong here: https://github.com/ccorcos/tuple-database/blob/79b6085911506dd95386cb7552e8993680cef2b3/src/useTupleDatabase.ts

I'm not calling setState within useMemo though... Basically, I want to be able to get an initial value synchronously while also subscribing. So that's what going on there...

Yeah this took me way to long to figure out. Apparently codesandbox automatically wraps your app in React.StrictMode inside index.js. And this causes every call to render and useMemo etc to be called TWICE to simulate the fact that in the next version of React you can't be sure how often react will render without committing those renders.

subcribeQuery is effectively a sideeffect function because it registers listeners that need to be removed later. Therefore it needs to be in useEffect which is not ideal because it's after the first render. Honestly they are making react harder to use with these changes.

Here's one solution but it means the query is ran twice, one for the first render and one to register callbacks

export function useTupleDatabase2<S extends KeyValuePair, T, A extends any[]>(
  db: TupleDatabaseClientApi<S>,
  fn: (db: TupleDatabaseClientApi<S>, ...arg: A) => T,
  args: A
) {
  const [result, setResult] = useState(() => fn(db, ...args));

  const resultRef = useRef<T>();
  resultRef.current = result;

  useEffect(() => {
    const doResult = (result: T) => {
      if (!shallowEqual(resultRef.current, result)) {
        setResult(result);
      }
    };
    const { result, destroy } = subscribeQuery(db, (db) => fn(db, ...args), doResult);
    doResult(result);
    return destroy;
  }, [db, fn, ...args]);

  return result;
}

Here's a thought: What if subscribeQuery didn't immediately register the callbacks but it returned the results and a function that does the subscribe. Then subscribeQuery would be sideeffect-free (therefore can be called multiple times without being cleaned up). Then the actual subscribe could happen in a useEffect. Not sure if it's possible to defer the subscriptions, and the api is a bit weird here.


My idea for the batch function is a bit different actually. Firstly it only works with keys and not ranges of keys because it's important that they don't overlap (I imagine you could generalise this to ranges of keys somehow)

image

Here's the pseudocode

def findBatch (list, keys):
    let results = []

    def findBatchInner (list, keys):

       if list.length == keys.length: 
          result.push(...list)
          return

       let [leftList, rightList] = list.splitAt(floor(list.length/2))

       let pivot = findPivotUsingBinarySearch(keys, rightList[0].key)
       let [leftKeys, rightKeys] = list.splitAt(pivot)

       if leftKeys.length: 
          findBatchInner(leftList, leftKeys)
       if rightKeys.length:
          findBatchInner(rightList, rightKeys)

   return result

def findPivotUsingBinarySearch(keys, comparison):
   # binary search the point where everything in
   # the right half is greater than or equal to comparison

It's a binary search on the database array except you do both halves, discarding any halves that don't have keys that we're interested in. We know this because the keys input is the same order and the entire database. As you can see in the screenshot half the entire dataset I-P was discarded immediately, and further in E,F was discarded. You do have to do a binary search of a subset of keys at every step, but this shrinks at each level, and some halves of the tree are discarded, and generally keys is small.

I think best case scenario is you ask for the keys A-H and it's just the cost of a single binary search in keys, and worse case is you ask for every other key in the entire data set, which I can't figure out the complexity of. I think it's good though because generally you're asking: give me all the user's with these IDs, so it quickly discards most things that aren't users and narrows in to the small portion of the database that is users, and it only does this once.

Curious to know your thoughts on this. I'm not really a database guy so I don't know if this is correct. If there's no glaring holes I'd be up for doing some benchmark tests


I've built some fairly complicated stuff with it and I think I've ironed out pretty much all of the bugs. There's a pretty solid test suite I've built for this thing...

That's great to hear, I think it has great potential so it's good that it's progressing. Small breaking changes don't matter to me, that's understandable.

Higher levels of abstraction is something I've thought of as well, I do like how it's low level enough that you can build your own abstractions on top, but I'm not sure how they'd look like for me yet.

The monadic builder looks pretty neat. So chain is like promise.then except it works whether you are async or not, and chainAll is like Promise.all etc? I can see that working, it's a safe bet to stick to promise-like functions

In my code I've only been using synchronous updates. I like that store get/set are sync, and I can just persist the entire set of changes to an API completely separately in some debounced function after a few seconds or so. So it only fetches from storage when the app first starts. It's really dumb and I guess it might only work for single user apps, I haven't thought too much about storage yet and I have no idea how it would handle conflicts.

Here's something I'm missing though, how can a transaction be asynchronous? If a concurrent transaction happens at the same time is it queued or does it throw an error? Either way if you're read/writing to external storage somewhere, how do you know that storage hasn't changed during the transaction?

I've also been working on a caching layer, similar to Relay actually

Is this like what I mentioned with keeping everything synchronous? Except I treat the local memory as the source of truth. I feel like this is what you want for local-first software. You just update your own local memory as you wish, and try and figure out how to merge that with external data later. Would like to know more about this if it's solving something I haven't thought about yet.

If I were to build Notion all over again

I think I read that notion uses postgres currently is that right? I read that reddit only uses two tables which kind of blew my mind. It's like using your db as a document store so it lets clients change easier. Is notion something similar? I feel like this is almost the same as how tuple-database works

That said, if you build a successful product on top of this, then no doubt you're going to hit the limits and probably start contributing to it (which would be great!)... That said, if you try to build something like Notion with something else, I think you'll find out that this is actually a pretty great abstraction haha.

Great answer :D will be happy to help out once I get my head around it more. I'm still experimenting but I really like the look of it right now. I like how it sort of removes levels of abstraction between the frontend and the storage. Instead of a service storing normalised data, which is then packaged up into objects in an API, just for the frontend to normalise out again and store it. Like just use a database in the frontend, simple.

ccorcos commented 2 years ago

Yeah this took me way to long to figure out. Apparently codesandbox automatically wraps your app in React.StrictMode inside index.js. And this causes every call to render and useMemo etc to be called TWICE to simulate the fact that in the next version of React you can't be sure how often react will render without committing those renders.

Yikes. When React becomes async by default, I think I'll be moving on to something else... all this bloat they're adding is over-optimization in my opinion.

Curious to know your thoughts on this. I'm not really a database guy so I don't know if this is correct. If there's no glaring holes I'd be up for doing some benchmark tests

Yeah this makes sense. I bet you could modify it to handle overlapping ranges too by basically converting into non-overlapping ranges and then piecing it back together.

This kind of thing definitely makes sense for the in-memory database. But you aren't going to get much of that performance gain with a SQLite or LevelDb backend, for example, because those storage engines don't support this kind of thing...

In my code I've only been using synchronous updates. I like that store get/set are sync, and I can just persist the entire set of changes to an API completely separately in some debounced function after a few seconds or so. So it only fetches from storage when the app first starts. It's really dumb and I guess it might only work for single user apps, I haven't thought too much about storage yet and I have no idea how it would handle conflicts.

Yeah, its nice to use sync when building an app, but permanent storage is almost always async. Also, its not great if you have to load all your data in memory at once when you start the app... That approach definitely works for now, but once you end up with collaboration and syncing, this breaks down.

Here's something I'm missing though, how can a transaction be asynchronous? If a concurrent transaction happens at the same time is it queued or does it throw an error? Either way if you're read/writing to external storage somewhere, how do you know that storage hasn't changed during the transaction?

My plan is keep track of transactions in an append-only log (just an index like [deviceId, int, WriteOps]) and then when you want to sync between devices, you basically need to sync these logs to each other and resolve the writes.

I haven't spent much time actually building this yet and I think that once you have more than 2 devices syncing, you need to track causality more granularly and use a merkle tree similar to git -- that's what Automerge is doing at least... But my point is that the syncing abstraction is a different layer.

Also, you can transactionally read-write to an async database. So if you want write to another remote database, just create a transaction. But if your goal is to sync a local database with a remote one, then transactions are certainly tricky...

Is this like what I mentioned with keeping everything synchronous? Except I treat the local memory as the source of truth. I feel like this is what you want for local-first software. You just update your own local memory as you wish, and try and figure out how to merge that with external data later. Would like to know more about this if it's solving something I haven't thought about yet.

Yes, exactly.

Here's the code I have so far. It's very much a work-in-progress but maybe you'll find it useful. This is a simplified version where subscriptions aren't re-used. But once this version works, it's just a matter of using some helper functions to re-use subscriptions... ```tsx import { isEqual } from "lodash" import { AsyncTupleDatabaseApi, AsyncTupleDatabaseClient, InMemoryTupleStorage, KeyValuePair, transactionalReadWrite, transactionalWrite, Tuple, TupleDatabase, TupleDatabaseClient, TupleDatabaseClientApi, WriteOps, } from "tuple-database" import { isBoundsWithinBounds } from "tuple-database/helpers/isBoundsWithinBounds" import { Bounds } from "tuple-database/helpers/sortedTupleArray" type CacheSchema = | { key: ["active", string]; value: Bounds } | { key: ["destroy", Bounds]; value: () => void } | { key: ["ready", Bounds]; value: boolean } | { key: ["deps", { containerId: string }, { id: string }]; value: Bounds } | { key: ["data", ...Tuple]; value: any } const db = new AsyncTupleDatabaseClient( new TupleDatabase(new InMemoryTupleStorage()) ) const cache = new TupleDatabaseClient( new TupleDatabase(new InMemoryTupleStorage()) ) type Sub = { id: string; bounds: Bounds } function findContainingSubscription(bounds: Bounds): Sub | undefined { const container = cache .subspace(["active"]) .scan() .map(({ key: [id], value: bounds }) => ({ id, bounds })) .find(({ bounds: container }) => isBoundsWithinBounds({ container, bounds }) ) return container } const registerDependentSubscription = transactionalWrite()( (tx, containerId: string, sub: Sub) => { tx.set(["deps", { containerId }, { id: sub.id }], sub.bounds) } ) const registerActiveSubscription = transactionalWrite()( (tx, sub: Sub, destroy: () => void) => { // TODO: better types for tx.set tx.set(["active", sub.id], sub.bounds) tx.set(["destroy", sub.bounds], destroy) } ) function isSubscriptionReady(bounds: Bounds) { return Boolean(cache.get(["ready", bounds])) } const handleReady = transactionalWrite()( (tx, bounds: Bounds, results: KeyValuePair[]) => { tx.set(["ready", bounds], true) handleData(tx, { set: results }) } ) const handleData = transactionalWrite()((tx, writes: WriteOps) => { tx.subspace(["data"]).write(writes) }) function startSubscription(sub: Sub) { const container = findContainingSubscription(sub.bounds) if (container) { registerDependentSubscription(cache, container.id, sub) } else { let destroyed = false let unsubscribe: (() => void) | undefined registerActiveSubscription(cache, sub, () => { destroyed = true if (unsubscribe) unsubscribe() }) return db .subscribe(sub.bounds, (writes) => { if (destroyed) return if (!isSubscriptionReady(sub.bounds)) return handleData(cache, writes) }) .then((destroy) => { if (destroyed) return destroy() unsubscribe = destroy return db.scan(sub.bounds).then((results) => { handleReady(cache, sub.bounds, results) }) }) } } function getActiveSubscription(id: string) { const result = cache.get(["active", id]) if (!result) throw new Error("Subscription not found.") return result } const transferDeps = transactionalReadWrite()( (tx, args: { from: string; to: string }) => { const deps = tx.subspace(["deps"]) deps .scan({ prefix: [{ containerId: args.from }] }) .forEach(({ key, value }) => { deps.remove(key) deps.set([{ containerId: args.to }, key[1]], value) }) } ) const deleteDeps = transactionalReadWrite()((tx, id: string) => { const deps = tx.subspace(["deps", { containerId: id }]) const result = deps.scan() result.forEach(({ key, value }) => { deps.remove(key) }) return result }) const deleteActiveSubscription = transactionalReadWrite()( (tx, sub: Sub) => { tx.remove(["active", sub.id]) tx.remove(["ready", sub.bounds]) const destroy = tx.get(["destroy", sub.bounds]) if (!destroy) throw new Error() tx.remove(["destroy", sub.bounds]) return destroy } ) const swapWithContainerSubscription = transactionalReadWrite()( (tx, sub: Sub, containerId: string) => { transferDeps(tx, { from: sub.id, to: containerId }) const destroy = deleteActiveSubscription(tx, sub) // NOTE: this is a side-effect, but we shouldn't have any transactional conflicts here // and it cleans up the code a bit so we're doing it anyways. destroy() } ) function findEqualentDependentSubscription(sub: Sub) { const dep = cache .subspace(["deps", { containerId: sub.id }]) .scan() .map(({ key: [{ id }], value: bounds }) => ({ id, bounds })) .find(({ bounds }) => isEqual(bounds, sub.bounds)) return dep } const swapWithDependentSubscription = transactionalReadWrite()( (tx, sub: Sub, depId: string) => { transferDeps(tx, { from: sub.id, to: depId }) tx.remove(["active", sub.id]) tx.set(["active", depId], sub.bounds) } ) function teardownSubscription(id: string) { const bounds = getActiveSubscription(id) const sub: Sub = { id, bounds } // Look for en equivalent subscription in its deps and swap it out. const dep = findEqualentDependentSubscription(sub) if (dep) { swapWithDependentSubscription(cache, sub, dep.id) return } // Look for an existing subscription that contains this one and merge deps. const container = findContainingSubscription(bounds) if (container) { swapWithContainerSubscription(cache, sub, container.id) return } // Startup dependent subscriptions, remove data from unused bounds. // TODO: all in one transaction. const destroy = deleteActiveSubscription(cache, sub) const deps = deleteDeps( cache, sub.id ).map(({ key: [{ id }], value: bounds }) => ({ id, bounds })) // TODO: no need to re-fetch either. just need to subscribe. // TODO: delete unused data from deps. Promise.all(deps.map((dep) => startSubscription(dep))).then(() => { destroy() }) } // TODO: handle the whole write path. function cacheLayer( db: AsyncTupleDatabaseApi ): [AsyncTupleDatabaseApi, TupleDatabaseClientApi] { const cache = new TupleDatabaseClient( new TupleDatabase(new InMemoryTupleStorage()) ) return [ { ...db, commit: (writes, txId) => { // TODO: // - Add txId to ignore TxIds. // - Ignore subscription updates for those txIds. // - Check if each write is within a subscription before optimistically writing. return db.commit(writes, txId) }, }, cache, ] } ```

I think I read that notion uses postgres currently is that right? I read that reddit only uses two tables which kind of blew my mind. It's like using your db as a document store so it lets clients change easier. Is notion something similar? I feel like this is almost the same as how tuple-database works

Ah yes, they basically have a triplestore.

I like how it sort of removes levels of abstraction between the frontend and the storage. Instead of a service storing normalised data, which is then packaged up into objects in an API, just for the frontend to normalise out again and store it. Like just use a database in the frontend, simple.

Oh yeah.. exactly!

Well let me know if I can help at all... Are you starting something new? Or just playing around in your free time?

JakeCoxon commented 2 years ago

Thanks for the response again, I was on holiday so a bit of a delay on my end.

The formatting is a bit bugged but I think I get it and it looks cool

I'm just going to try and get an app working with just offline storage for now. I'm mainly just playing around in my free time, I always wanted to make a wiki-like notes app similar to Notion but completely offline and with some extra specific features I wanted. Finding this library sort of kick started it. I'm also looking to make my own native UI for it so it might get ambitious

BTW I'm running into this bug, I've written a test case for it here - https://github.com/ccorcos/tuple-database/commit/7541989da67cd73257c7288c68785e4145eda06a

It's a problem with tx.scan where the initial db.scan uses the limit field, but then it potentially removes results from the writes.removes list. I believe the fix is figure out the number of removes first, and then modify the limit going into db.scan.

ccorcos commented 2 years ago

Fixed the formatting, sorry.

with some extra specific features I wanted

Curious what features you're looking for.

make my own native UI for it so it might get ambitious

I looked into that... you'll spend all your time just getting text layout and text selection lol

BTW I'm running into this bug, I've written a test case for it here - https://github.com/ccorcos/tuple-database/commit/7541989da67cd73257c7288c68785e4145eda06a

Good find!

https://github.com/ccorcos/tuple-database/blob/79b6085911506dd95386cb7552e8993680cef2b3/src/database/async/AsyncTupleDatabaseClient.ts#L152-L156

Looks like if we've removed 2 items from the range we're scanning, then we need to fetch two more items from storage.

Fixed it here https://github.com/ccorcos/tuple-database/commit/5d98cce81bdec0f17b6768eb77600d3a69b10fae and released a new version 2.1.17. Also discovered a really pesky IndexedDb bug 🙄