apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.7k stars 1.04k forks source link

Refactor SegmentInfo / FieldInfo to make them extensible [LUCENE-4055] #5127

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

After #5122 is done the resulting SegmentInfo / FieldInfo classes should be made abstract so that they can be extended by Codec-s.


Migrated from LUCENE-4055 by Andrzej Bialecki (@sigram), resolved May 28 2012 Attachments: LUCENE-4055.patch Linked issues:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

rough plan:

asfimport commented 12 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

+1. Per commit information could be named CommitInfo (which it really is). I like SegmentMetadata - if we left SegmentInfo as a name it would be confusing with its current functionality.

Introspection could return Map<String,Object> to avoid converting e.g. numeric values back and forth.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Branch location for this issue and #5122: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene4055

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just some updates from the work in the branch (scary changes but proceeding nicely since Mike jumped in and did a lot of it). Here's a list of the current progress:

not yet done:

I'm also probably missing a few other things. In general I'm pretty happy with the "metadata" key-value attributes api versus subclassing.

I tried to make subclassing work, but subclassing turned really ugly fast and made various codec components too tightly-coupled, e.g. if someone wants to combine a CompressedStoredFields with a PerFieldPostingsFormat and SpecialTermVectors, what would the impls be :).

So the overly simple Map<String,String> avoids these issues, and hey its just metadata after all so I don't think anything more complex is really needed.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The last nocommits are cleared up: I think the branch is ready to be merged.

Attached is a patch showing the differences between trunk and branch.

asfimport commented 12 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

+1, this looks very good.

One comment re. SegmentInfoPerCommit. This class is not extensible and contains a fixed set of attributes. In #4910 this or similar place would be the ideal mechanism to carry info about stacked segments, since this information is specific to a commit point. Unfortunately, there are no Map<String,String> attributes on this level, so I guess for now this type of aux data will have to be put in SegmentInfos.userData even though it's not index global but segment-specific.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right, but I think this is correct: the codec should be responsible for encode/decode of inverted index segments only (the whole problem here originally was trying to have it also look after commits).

So it really shouldn't be customizing things about the commit, as that creates a confusing impedance mismatch.

I think things like stacked segments in #4910 that need to do things other than implement encoding/decoding of segment should be above the codec level: since its a separate concern, if someone wants to have updatable fields thats unrelated to the integer compression algorithm used or what not.

asfimport commented 12 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

stacked segments in #4910 that need to do things other than implement encoding/decoding of segment should be above the codec level ..

Certainly, that's why it would make sense to put this extended info in SegmentInfoPerCommit and not in any file handled by Codec. My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so #4910 will need to use for now the codec-independent index-global data (SegmentInfos). It's not ideal but not a deal breaker either, especially since we now have version info in both of these places.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

My comment was about the lack of easy extensibility of the codec-independent per-segment data (SegmentInfoPerCommit - info about stacked data is per-segment and per-commit), so #4910 will need to use for now the codec-independent index-global data (SegmentInfos).

Well SegmentInfos is just the list of SegmentInfoPerCommit, so I think in the case of #4910 we would just place this data alongside the only other per-segment-per-commit data: deletes (e.g. we would add something like updatesGen and updatesCount or whatever).

Sure, we have to bump the file header and what not, but the file is so simple now in the sense its just a list of segment names, the codec to decode them, with their deletes, that it wouldn't be a big deal.

And I think per-segment-per-commit data is pretty rare: we only have deletes, and in the future updates, but I can't imagine lots of other stuff belonging in this category (versions the per-segment metadata in SI which has been pretty volatile in the past).

asfimport commented 12 years ago

Renaud Delbru (migrated from JIRA)

Does this patch allows TermsHashConsumerPerField to be dependent to the Codec/Field ? I have a use case where I need my own TermsHashConsumerPerField for certain fields.

At the moment, the only extension of TermsHashConsumer is FreqProxTermsWriter which is creating a FreqProxTermsWriterPerField for every field (see FreqProxTermsWriter#addField()).

A possible solution of my problem would be to have a PostingsTermsWriter (extension of TermsHashConsumer) that will create a specific PostingsTermsWriterPerField (extension of TermsHashConsumerPerField) depending on the codec information found in the FieldInfo object.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Renaud, what you are talking about is not really related to this issue: this issue is about allowing codecs to add metadata to SegmentInfo and FieldInfo.

what you are talking about is the indexing chain, if you want to customize that, just set a custom one on IndexWriterConfig.

asfimport commented 12 years ago

Renaud Delbru (migrated from JIRA)

Hi Robert,

sorry if it seemed a bit out of context, but I am trying to understand how to properly do it.

Indeed, I can create my own indexing chain which includes my TermsHashConsumer customisation. However, I would still need the codec metadata for every field. But from what you told me, it seems that this codec specific metadata could be now added to FeildInfo. Is that correct ?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

However, I would still need the codec metadata for every field.

What codec metadata? this change only allows codecs to add additional things to fieldinfo/segmentinfo so they can later be read when the segment is opened. E.g. a CompressingStoredFieldsWriter could put in the segmentinfo an additional key-value like "CompressingStoredFieldsWriter.algorithm=deflate"

These are private to that component basically: i don't understand why your indexing chain would care about this? its at a level above all that.

asfimport commented 12 years ago

Renaud Delbru (migrated from JIRA)

What codec metadata?

Metadata that indicates which codec is used for a particular field.

Let say I want to have a specific TermsHashConsumerPerField depending on the codec used by a field. For example, for field A and field B which use the Lucen40 codec, we need to use the FreqProxTermsWriterPerField. And for field C that uses my own specific codec, I need to use the MyOwnTermsWriterPerField.

My current understanding tells me that to do this, the only way is to customise the IndexingChain with a new TermsHashConsumer that overrides the method TermsHashConsumer#addField(TermsHashPerField termsHashPerField, FieldInfo fieldInfo). This method addField will be able to instantiate the correct TermsHashConsumerPerField if and only if there is codec metadata in the FieldInfo parameter. That's why I am interested of using a customised FieldInfo to store codec-related metadata about a field.

Or is there a better way to get codec-related information about a field in the IndexingChain ?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Codecs are not per-field, they encode the entire inverted index segment.

asfimport commented 12 years ago

Renaud Delbru (migrated from JIRA)

Sorry, I meant PostingsFormat instead of Codec.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well you can do postingsFormat instanceof PerFieldPostingsFormat + postingsFormat.getPostingsFormatForField if you really want.

But keep in mind PerFieldPostingsFormat is not really special and just one we provide for convenience, obviously one could write their own PostingsFormat that implements the same thing in a different way.

asfimport commented 11 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

@mikemccand what's the replacement strategy for IndexFileNameFilter? I'm updating some old code from 3.x to 4.3 (MAHOUT-944) and not sure on what the equivalent approach is, or whether I even need it. (Note, I'm still trying to figure out whether the patch for MAHOUT-944 is even the best way to do what it is trying to do, but I want to at least get it compiling first)

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi @gsingers, I think you can use the IndexFileNames.CODEC_FILE_PATTERN in your filter? You may need to add in segments_N and segments.gen as well ...

asfimport commented 11 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Hmm, Mike, CODEC_FILE_PATTERN is package access only. Easy enough to replicate/fix, any reason not too?

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm looks like it's package private in 4.3 but is (will be) public in 4.x/trunk. Just replicate for now :)