dexie / Dexie.js

A Minimalistic Wrapper for IndexedDB
https://dexie.org
Apache License 2.0
11.52k stars 642 forks source link

Filter collection by related-table's collection #666

Closed poltak closed 6 years ago

poltak commented 6 years ago

tl;dr: If I did two queries to get Collections for two different tables, is there some way to efficiently filter one of those collections on the other, based on some related properties - or am I just making life difficult for myself trying to use Dexie like this?

Hi there. First off, apologies if this has already been asked somewhere. TBH I am not really sure what I could search for this. Looked at previous discussion on table joins, but does not really solve my problem. May also be thinking about tables and relations in Dexie completely the wrong way; please let me know if so.

Been recently learning about Dexie and looking to port over existing data model in IndexedDB to Dexie's abstractions.

Take two simple tables:

A page represents a bare-bones web page with some extracted text terms to search on, a visit being an event in time where someone visited that web page - many visits to one page.

I am completely lost with how I could perform space-efficient query/s to do something like: "get all pages with matching term, filtered by those pages with visits between times A and B".

Thought of some ways like the following (assume some method on a class extends Dexie, and pretends it's inside a transaction):

async search(query, startTime = 0, endTime = Date.now()) { 
    const visits = await this.visits
                .where('[time+url]')
                .between(
                    [startTime, ''],
                    [endTime, String.fromCharCode(65535)],
                )
                .toArray()

       const visitUrls = new Set(visits.map(visit => visit.url))
       return this.pages
                .where('terms')
                .equals(query)
                .filter(page => visitUrls.has(page.url))
                .toArray()
}

Works okay but, from what I understand of Dexie.Collection, because of the toArray() call in the visits table query, we would end up having worst-case space linear to the size of the visits table, just so the Collection.filter call in the pages table query can filter out non-matching pages. Not really ideal.

I could add a hasVisitInRange method to a class representing pages, which queries the visits table on all visits with matching URL and derives a bool (where + between + count), but as it's async I don't think I can just add it as a Collection.filter like:

async search(query, startTime = 0, endTime = Date.now()) { 
       const matchingPagesColl = this.pages
                .where('terms')
                .equals(query)

      return matchingPagesColl.
                .filter(page => page.hasVisitInRange(startTime, endTime))
                .toArray()
}

Instead I would need to matchingPagesColl.toArray() to pull all matching pages into mem first, then do all the async page.hasVisitInRange lookups and use something like Array.prototype.filter on the results. This would just reverse the space issue, making it linear to matchingPagesColl.count(), which is also not ideal. It would also be doing matchingPagesColl.count() separate lookups on the visits table, which doesn't seem like a good thing either. (first example also has space issues, but at least # table queries is always 2)

Please let me know if this makes no sense. Thanks for taking the time to read as well!

dfahlander commented 6 years ago

Thanks for a very engaging question.

It's not obvious about how to solve these cases, but with the power of [Collection.keys()](http://dexie.org/docs/Collection/Collection.keys()) and [Collection.primaryKeys()](http://dexie.org/docs/Collection/Collection.primaryKeys()) can be good tools to load keys without loading the entire data.

With some limited time right now, I'll try to suggest a solution later. Without diving too deep into the problem, the strategy depends on the size and spread of each table related to the query. An SQL database would have these stats recorded and try to pick a reasonable query plan. With dexie, you have to do the query plan yourself.

For example you show a view in your app where your limit the results to the 100 most recent visits, it would scale over time if you start querying the visits table. You can construct a better paging than offset/limit that dexie provides, by altering your picking fromTime from the last result of last page. I'm also not sure the terms index is needed if so, as I assume terms are less spread than urls. But that's just a guess, and one idea would be something like the following:

async function search (query, startTime, endTime, limit) {
   const visits = await this.visits
                .where('[time+url]')
                .between(
                    [startTime, ''],
                    [endTime, String.fromCharCode(65535)],
                )
                .reverse() // Assuming you want the most recent visits first
                .limit(limit)
                .toArray();
  const urls = visits.map(visit => visit.url);
  const pages = await this.pages.where('url').anyOf(urls).toArray();
  return {
    result: pages.filter(page => page.terms.includes(query)),
    nextPageEndTime: visits[visits.length-1].time
  }
} // returns {result, nextPageEndTime}

The query might have to be re-executed in order to fill each page, since items where terms not match will be omitted.

poltak commented 6 years ago

Thanks for the in-depth response @dfahlander! Should have clarified the data a bit more, but terms are actually more spread (although not 100% sure what's meant by "spread" here). They are essentially all the terms that appear in some web page's text after putting it all through a text preprocessing pipeline - to afford full-text search for stored pages.

I wasn't aware of the Collection.keys/primaryKeys methods, but these seem to be really appropriate for my use-case! Also only really caring about latest visits for now, so that Collection.reverse is exactly what I need.

With that information and looking around at some other examples you've given, I've come up with the following, which seems a lot more efficient:

async search({
        query,
        startTime = 0,
        endTime = Date.now(),
        skip = 0,
        limit = 10,
}) {
            // Fetch all latest visits in time range, grouped by URL
            const latestVisitByUrl = new Map()
            await this.visits
                .where('[time+url]')
                .between(
                    [startTime, Storage.MIN_STR],
                    [endTime, Storage.MAX_STR],
                )
                .reverse() // Go through visits by most recent
                .eachPrimaryKey(([url, time]) => {
                    // Only ever record the first (latest) visit for each URL
                    if (!latestVisitByUrl.get(url)) {
                        latestVisitByUrl.set(url, time)
                    }
                })

            // Fetch all pages with terms matching query
            const matchingPageUrls = await this.pages
                .where('terms')
                .equals(query)
                .filter(page => latestVisitByUrl.has(page.url))
                .primaryKeys()

            // Paginate
            return matchingPageUrls
                .map(url => [latestVisitByUrl.get(url), url]) // shape: [time: number, url: string]
                .sort(([a], [b]) => a - b)
                .slice(skip, skip + limit)
}

Now simply returning the IDs (URLs in this case) - caller can do the constant size lookup. Query on visits table uses a Collection.eachPrimaryKey to only build up a grouped-by-URL Map in memory (could be hundreds of visits per URL). Then the query on pages table matches on the terms search, filtering out pages based on those visit URLs (on the Collection rather than an in-mem array), bringing into mem only the Array of page ID/URLs (the filter guarantees this to be of size <= the # of visit URLs in mem). Then just map the saved latest times, to act as a score to sort by and be able to perform pagination.

Do you see any obvious problems with that or things I may be overlooking? The latestVisitByUrl Map still has a worst case size linear to the visits table, although the severity would be reduced a lot by the group-by, depending on # of visits per page. Worst-case is only when there is a text query but no specified time bounds (so "get me the latest N pages with term: X"). Actually I think any time endDate is not specified, there should be a simpler path to take by just doing a lookback over that keyPath on an ordered-by-time index, until enough results are found (which I think can be done with a few Collection methods).

dfahlander commented 6 years ago

The line .filter(page => latestVisitByUrl.has(page.url)) will force dexie to load values even though you are selecting primaryKeys(). It also forces the use of openCursor() which is slower than getAll()/getAllKeys(), so I think you'd get better performance leaving that line out and doing the filtering on the results after awaited. I know this comes with a potential memory-scaling problem if number of page results would be huge for the particular query. But I suppose this is only theoretical as you filter by exact term.

dfahlander commented 6 years ago

Saw another issue: swap url with time in .eachPrimaryKey(([url, time]) to .eachPrimaryKey(([time, url])

poltak commented 6 years ago

Forgot about this issue. We've ended up going with Dexie as the main DB for our web extension, Memex, and have fleshed-out the search a lot more in the last few months since this issue was created. We have implemented full-text search for visited web pages, allowing filtering by things like visit time (the example in OP), domains, tags, other stuff, all using Dexie.

I will leave a link here to our search implementation for anyone who's reading this issue in the future and wants to see a bit more complex working example. Things probably still a bit messy and confusing right now, but will hopefully improve over time (PRs welcome): https://github.com/WorldBrain/Memex/tree/master/src/search/search-index-new/search

Thanks a lot once again for your help @dfahlander, and for the wonderful software!

dfahlander commented 6 years ago

I added a link to Memex on our derived work page. Really nice extension. Looking forward to using it.