Per-segment tracking of external/side-car data [LUCENE-4658]

asfimport commented 11 years ago

Spinoff from David's idea on #5328 (https://issues.apache.org/jira/browse/LUCENE-4258?focusedCommentId=13534352&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13534352 )

I made a prototype patch that allows custom per-segment "side-car data". It adds an abstract ExternalSegmentData class. The idea is the app implements this, and IndexWriter will pass each Document through to it, and call on it to do flushing/merging. I added a setter to IndexWriterConfig to enable it, but I think this would really belong in Codec ...

I haven't tackled the read-side yet, though this is already usable without that (ie, the app can just open its own files, read them, etc.).

The random test case passes.

I think for example this might make it easier for Solr/ElasticSearch to implement things like ExternalFileField.

Migrated from LUCENE-4658 by Michael McCandless (@mikemccand), 4 votes, updated Mar 17 2013 Attachments: LUCENE-4658.patch (versions: 2)

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Initial prototype patch ...

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, 1) moving this thing into Codec as (optional) getExternalDataFormat, and 2) adding basic read-time support. Test still passes (and I was able to remove the hack commit/waitForMerges necessary before because read-time support wasn't there)...

Now I'm not sure it really "belongs" on Codec: it's sort of awkward because it's an indexing chain PLUS codec part (on the write side). It's also ... heavyweight now because to use it you must make a Codec and name it (so SPI can find it at read time), so for Solr (eg) to use this for an ExternalFileField ... seems hardish.

Maybe it should be back on IndexWriterConfig (write side), but then we'd need something on the read side ... not sure.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Throwing a wild idea, I'm not sure if it can work. What if we support that by introducing a CustomField which takes an Object as its value, and IW's chain will forward those per-document values to Codec.CustomDataFormat? CustomField, I think, should not support FieldType settings, but hardcode to not stored, inedxed (?), not tokenized etc. Codec will support CustomDataFormat and IndexReader will expose Object getCustomData(field). Can this work?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is basically the approach I took in the last patch? Except I didn't add the CustomField "sugar"; instead, the plugin sees the entire Doc and can grab the fields it's interested in. I'm not sure we need CustomField, since it could be an ordinary (existing) field that this plugin wants to look at. It could also be a field that is otherwise indexed/stored anyway, and this plugin is just doing something further with it.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Hmm ... I thought the idea is to allow the app to pass some data that is not a 'field' per se. I mean, if I wanted to take advantage of some data that's in the stored fields, I could make a Codec which wraps that particular format with whatever logic I want to do "on the side"?

I think, but you are definitely the expert here, that if we go the CustomField approach we allow the app to be as flexible as it needs, without tapping into IndexingChain. It can stuff into that field whatever combination of fields it added to the document already. That field, together with CustomDataFormat, gives the app all the freedom that it needs. We already force the app to add the same field twice if e.g. it wants to add a DV field and store it. I don't think this case should be treated differently.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

As I understand it, Lucene's faceting module uses a side-car index. If so, then if the feature proposed here is a good API then the faceting module will use it. No?

Ultimately, it would be cool to be able to expose an externally managed field as if it were DocValues, and thus any code that uses DocValues could use it without changing its code. That would be awesome. I don't know if that would be a part of this issue or follow-on that would use the API in this issue to make that happen.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

if we go the CustomField approach we allow the app to be as flexible as it needs

OK, I see: this CustomField (ExternalField maybe?) would be totally opaque to existing Lucene indexing code, and would hold an arbitrary Object which the IW chain plugin could then grab. I agree this makes it more generic! I think this makes sense?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

As I understand it, Lucene's faceting module uses a side-car index. If so, then if the feature proposed here is a good API then the faceting module will use it. No?

It does use a side-car (taxonomy) index, so that facet labels use global ords, which makes counting/NRT reopen fast.

But, that index is global, vs this patch which adds a per-segment side-car, so it wouldn't quite fit, until/unless we change taxonomy writer/reader to work per-segment.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't understand why lucene's facet module would need:

BinaryDocValues (its the only thing using this, this data type is really useless otherwise)
something additional here on this issue.

as far as I'm concerned, one is enough. so if this one is added, please remove BinaryDocValues.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

ExternalField maybe?

I think of ExternalField as something that resides outside the index, while CustomField is part of the index. Therefore I prefer custom vs external, but that's just naming.

First, this issue may not be used by facets at all. And I agree with Robert that there's no point making two implementations for a custom data format. Today we have the payloads and BDV as enablers to encode arbitrary data into a byte[] (BDV is faster). I think that should be enough, as long as what you want is a per-document custom data.

But if you want to encode per-segment global data (e.g. a taxonomy, a graph), then BDV (or payload) are not the right API as they are per-document. Rather, I think it will be good if we have this CustomDataFormat which is completely opaque to Lucene, yet gives the app a lot of flexibility: CustomField passed on Documents (at least in my scenarios these per-document datum comprise the larger per-segment data structure) takes an Object, CustomDataFormat encodes them however it needs, and is also responsible for merging across segments, IR gives you a CustomData back. That's it. You app can then cast and work with that data however it wants. We can have the getCustomData take a field, in case you want to encode two such structures, but we don't need to at first.

If for some reason the app needs custom data per-document and cannot work with neither payloads nor BDV, then it needs to have a CustomData type that exposes per-document API. In either case, Lucene should not care what's in that data except in the indexing chain (to call the right format's API) and during merge, to invoke CustomDataFormat.merge().

I hope that's enough?

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

You raise a good point there Rob; BinaryDocValues is pretty close and might be sufficient as-is. But do we need segment based tracking hooks? Perhaps it's useful for parallel / overlay indexes that maintain docid consistency (LUCENE-4258 ?), but I don't think that needs to be centered around any particular special field. Shai's issue description points to a comment I made but it was in turn a quote of Rob. Rob & I didn't call out a need for segment level tracking; it was commit level tracking. A couple use-cases I had in mind when I made the comment are:

Storing per-document data that changes often like the number of clicks/accesses to the search result – ultimately used to influence scoring. The application's backing store would probably be an in-memory cache with occasional syncs to disk.
Storing a large per-document body text in an external data source (e.g. a DB or file system). Lucene needlessly merges stored fields which I think is quite wasteful, not to mention putting it in Lucene is redundant if you already manage it somewhere else. It's ultimately needed via Lucene's API for highlighting.

Is per-segment tracking needed for this? Or is this really about hooks to enable a parallel segment level index? I dunno.

apache / lucene

Per-segment tracking of external/side-car data [LUCENE-4658] #5723