Closed poltak closed 6 years ago
Thanks for a very engaging question.
It's not obvious about how to solve these cases, but with the power of [Collection.keys()](http://dexie.org/docs/Collection/Collection.keys()) and [Collection.primaryKeys()](http://dexie.org/docs/Collection/Collection.primaryKeys()) can be good tools to load keys without loading the entire data.
With some limited time right now, I'll try to suggest a solution later. Without diving too deep into the problem, the strategy depends on the size and spread of each table related to the query. An SQL database would have these stats recorded and try to pick a reasonable query plan. With dexie, you have to do the query plan yourself.
For example you show a view in your app where your limit the results to the 100 most recent visits, it would scale over time if you start querying the visits table. You can construct a better paging than offset/limit that dexie provides, by altering your picking fromTime from the last result of last page. I'm also not sure the terms index is needed if so, as I assume terms are less spread than urls. But that's just a guess, and one idea would be something like the following:
async function search (query, startTime, endTime, limit) {
const visits = await this.visits
.where('[time+url]')
.between(
[startTime, ''],
[endTime, String.fromCharCode(65535)],
)
.reverse() // Assuming you want the most recent visits first
.limit(limit)
.toArray();
const urls = visits.map(visit => visit.url);
const pages = await this.pages.where('url').anyOf(urls).toArray();
return {
result: pages.filter(page => page.terms.includes(query)),
nextPageEndTime: visits[visits.length-1].time
}
} // returns {result, nextPageEndTime}
The query might have to be re-executed in order to fill each page, since items where terms not match will be omitted.
Thanks for the in-depth response @dfahlander! Should have clarified the data a bit more, but terms are actually more spread (although not 100% sure what's meant by "spread" here). They are essentially all the terms that appear in some web page's text after putting it all through a text preprocessing pipeline - to afford full-text search for stored pages.
I wasn't aware of the Collection.keys/primaryKeys
methods, but these seem to be really appropriate for my use-case! Also only really caring about latest visits for now, so that Collection.reverse
is exactly what I need.
With that information and looking around at some other examples you've given, I've come up with the following, which seems a lot more efficient:
async search({
query,
startTime = 0,
endTime = Date.now(),
skip = 0,
limit = 10,
}) {
// Fetch all latest visits in time range, grouped by URL
const latestVisitByUrl = new Map()
await this.visits
.where('[time+url]')
.between(
[startTime, Storage.MIN_STR],
[endTime, Storage.MAX_STR],
)
.reverse() // Go through visits by most recent
.eachPrimaryKey(([url, time]) => {
// Only ever record the first (latest) visit for each URL
if (!latestVisitByUrl.get(url)) {
latestVisitByUrl.set(url, time)
}
})
// Fetch all pages with terms matching query
const matchingPageUrls = await this.pages
.where('terms')
.equals(query)
.filter(page => latestVisitByUrl.has(page.url))
.primaryKeys()
// Paginate
return matchingPageUrls
.map(url => [latestVisitByUrl.get(url), url]) // shape: [time: number, url: string]
.sort(([a], [b]) => a - b)
.slice(skip, skip + limit)
}
Now simply returning the IDs (URLs in this case) - caller can do the constant size lookup.
Query on visits table uses a Collection.eachPrimaryKey
to only build up a grouped-by-URL Map
in memory (could be hundreds of visits per URL).
Then the query on pages table matches on the terms search, filtering out pages based on those visit URLs (on the Collection
rather than an in-mem array), bringing into mem only the Array of page ID/URLs (the filter guarantees this to be of size <= the # of visit URLs in mem).
Then just map the saved latest times, to act as a score to sort by and be able to perform pagination.
Do you see any obvious problems with that or things I may be overlooking?
The latestVisitByUrl
Map still has a worst case size linear to the visits table, although the severity would be reduced a lot by the group-by, depending on # of visits per page.
Worst-case is only when there is a text query but no specified time bounds (so "get me the latest N pages with term: X"). Actually I think any time endDate
is not specified, there should be a simpler path to take by just doing a lookback over that keyPath on an ordered-by-time index, until enough results are found (which I think can be done with a few Collection
methods).
The line .filter(page => latestVisitByUrl.has(page.url))
will force dexie to load values even though you are selecting primaryKeys(). It also forces the use of openCursor() which is slower than getAll()/getAllKeys(), so I think you'd get better performance leaving that line out and doing the filtering on the results after awaited. I know this comes with a potential memory-scaling problem if number of page results would be huge for the particular query. But I suppose this is only theoretical as you filter by exact term.
Saw another issue: swap url with time in .eachPrimaryKey(([url, time])
to .eachPrimaryKey(([time, url])
Forgot about this issue. We've ended up going with Dexie as the main DB for our web extension, Memex, and have fleshed-out the search a lot more in the last few months since this issue was created. We have implemented full-text search for visited web pages, allowing filtering by things like visit time (the example in OP), domains, tags, other stuff, all using Dexie.
I will leave a link here to our search implementation for anyone who's reading this issue in the future and wants to see a bit more complex working example. Things probably still a bit messy and confusing right now, but will hopefully improve over time (PRs welcome): https://github.com/WorldBrain/Memex/tree/master/src/search/search-index-new/search
Thanks a lot once again for your help @dfahlander, and for the wonderful software!
I added a link to Memex on our derived work page. Really nice extension. Looking forward to using it.
tl;dr: If I did two queries to get
Collection
s for two different tables, is there some way to efficiently filter one of those collections on the other, based on some related properties - or am I just making life difficult for myself trying to use Dexie like this?Hi there. First off, apologies if this has already been asked somewhere. TBH I am not really sure what I could search for this. Looked at previous discussion on table joins, but does not really solve my problem. May also be thinking about tables and relations in Dexie completely the wrong way; please let me know if so.
Been recently learning about Dexie and looking to port over existing data model in IndexedDB to Dexie's abstractions.
Take two simple tables:
'url, *terms'
'[time+url]'
A page represents a bare-bones web page with some extracted text
terms
to search on, a visit being an event in time where someone visited that web page - many visits to one page.I am completely lost with how I could perform space-efficient query/s to do something like: "get all pages with matching term, filtered by those pages with visits between times A and B".
Thought of some ways like the following (assume some method on a
class extends Dexie
, and pretends it's inside a transaction):Works okay but, from what I understand of
Dexie.Collection
, because of thetoArray()
call in the visits table query, we would end up having worst-case space linear to the size of the visits table, just so theCollection.filter
call in the pages table query can filter out non-matching pages. Not really ideal.I could add a
hasVisitInRange
method to a class representing pages, which queries the visits table on all visits with matching URL and derives a bool (where + between + count), but as it's async I don't think I can just add it as aCollection.filter
like:Instead I would need to
matchingPagesColl.toArray()
to pull all matching pages into mem first, then do all the asyncpage.hasVisitInRange
lookups and use something likeArray.prototype.filter
on the results. This would just reverse the space issue, making it linear tomatchingPagesColl.count()
, which is also not ideal. It would also be doingmatchingPagesColl.count()
separate lookups on the visits table, which doesn't seem like a good thing either. (first example also has space issues, but at least # table queries is always 2)Please let me know if this makes no sense. Thanks for taking the time to read as well!