eugeneware / pairs

Turn a JSON object into a list of pairs. Useful for indexing.
3 stars 1 forks source link

index key name #1

Open rudylacrete opened 9 years ago

rudylacrete commented 9 years ago

Hi Mr Ware,

I begin to use your module in a project I'm working on and I have one question about the way you manage index key name. I've noticed that nested object properties are named without keeping relationship with its parent. {prop1:{prop2:'val'}}will result to index [['prop1':'prop2'],['prop2':'val']]. If the object contains multiple similar key-value properties at different nested level, only one index will be created and we have no idea about where the index comes from. In this case, we will have multiple candidates when querying the document for this index and the match step done by jsonquery will remove objects which don't have the requested property at the correct level. This behaviour is related to the way the query index will be transformed to match the pairs indexing strategy (path.split, takeTwo, ...).

I've created a modified version which use the full property path when creating indexes and use the full path also when querying indexes. The previous example generate the following indexes: [['prop1':'prop2'],['prop1.prop2':'val']]. This can significantly improve performance on large datasets with complex objects as we have less dataHits. For example, if you have this document {prop1:{prop2:'val'},'prop2':'val'} and this one {prop1:{prop3:'val'},'prop2':'val'}, a query with the filter {prop1.prop2:'val'} will generate only one dataHit instead of two.

Do you think this way of doing things is relevant and could it break other features into jsonquery-engine?

Best regards.

eugeneware commented 9 years ago

The default "Property" indexing strategy for jsonquery-engine should give you what you want:

https://github.com/eugeneware/jsonquery-engine#indexing-strategy-support

https://github.com/eugeneware/jsonquery-engine/blob/master/test/level-plan.js#L201-L205

The pairs index is a bit of a cheat to help you quickly index EVERY thing, get a good hit rate, and then filter from a smaller subset of results. Even if it pulls the incorrect results for the index lookup, it will not be returned as a result.

You certainly could index the full path for each and every path key (see https://github.com/eugeneware/path-engine for something that takes this approach but for a single property).

Though, then you have to wear the full index weight of slicing up the entire object into key paths and storing them (which is fine if indexing space is not a problem).

Using pairs provides a nice tradeoff of low index size overhead and fast performance.