apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Rename AppendingCodec to Appending40Codec [LUCENE-4399] #5465

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

In order AppendingCodec to follow Lucene codecs version, I think its name should include a version number (so that, for example, if we get to releave Lucene 4.3 with a new Lucene43Codec, there will also be a new Appending43Codec).


Migrated from LUCENE-4399 by Adrien Grand (@jpountz), resolved Oct 09 2012 Attachments: LUCENE-4399.patch (versions: 5)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure of the need for this: we don't need to provide back compat for anything in the codecs/ module, just the default codec?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I wasn't saying i was totally against the idea btw, i was just trying to invite some discussion..., just to explain my rationale:

Personally I dont think its that great we have version numbers in our default codec, I was the one that proposed this because its simple on our end as Lucene developers to implement backwards compatibility this way: but it encourages some code duplication and stuff like that: which is in my mind, a lesser evil than conditionals inside one "mega-impl" like before, that must handle different binary formats.

And i think this is pretty contained to expert users: most people will just create an indexwriter and be oblivious to this.

As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to try to keep these implementations simple. If we start adding versioning and back compat to them, then we are basically expanding our backwards compatibility commitments to N versions * M formats, which is something that requires a very big discussion.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

which is in my mind, a lesser evil than conditionals inside one "mega-impl" like before

Agreed.

As far as alternative things in codecs/, I feel it would be best to not add complication for back-compat, instead to try to keep these implementations simple.

I think we should do our best so that backwards-compatiblity issues in lucene-codecs don't happen because of changes in the default Lucene codec or postings format.

For example, if we release a new Lucene43PostingsFormat, we will either need to create a new Pulsing43PostingsFormat (to maintain backwards compatibility) or just replace the current Pulsing40PostingsFormat with Pulsing43PostingsFormat (if we don't care).

Maybe we could remove Pulsing40PostingsFormat and modify PulsingPostingsFormat so that it:

The only modification required if we release a new Lucene43PostingsFormat will be to change the default wrapped postings format.

I think we should be able to do this with AppendingCodec and DirectPostingsFormat too.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It would be great to simplify Pulsing the way that you suggest!

But I have a tough time with it, it doesnt actually wrap any arbitrary postings, instead only ones that extend "PostingsBaseFormat" (and this extension is a little confusing, any ideas towards refactoring it to be cleaner I think would be helpful).

If you ask me right now, I don't remember the reason why Pulsing needs such a low-level interaction (versus just working on the higher-level e.g. termsenum apis).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to fixing any wrapped PostingsFormats to save the name of what they wrapped into the index.

Pulsing currently wraps a PostingsBaseFormat because it needs to "intervene" on a term by term basis on the communication b/w the terms dict and the postings format it wraps. It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how.

Separately I really don't like the name PostingsBaseFormat :) But I can't think of something better ... it's basically the PostingsFormat minus the terms dict.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

It would be really nice if we could wrap a PostingsFormat instead of a PostingsBaseFormat instead ... I'm just not sure how.

That would be great. This way we wouldn't need to create a SPI loader for PostingsBaseFormat in order to let Pulsing store/restore the wrapped PostingsBaseFormat.

But I can't think of something better ... it's basically the PostingsFormat minus the terms dict.

So maybe it is PostingsFormat that should be renamed (PostingsAndTermsFormat?) :-)

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This patch attemps to fix the problem for the Appending codec and the Direct postings format (BloomFilter already serialized the name of its wrapped postings format).

I am not fully satisfied with the way I handle AppendingCodec, suggestions are welcome.

I had a look at Pulsing but this looks complicated (to me at least)... For Pulsing, should we rather:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dont want to use an SPI loader for postingsbaseformat.

i dont even like postingsbaseformat :)

SPI is really heavy and we should avoid it, I think we just need to look at how we can refactor this thing so it works the way we want.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hmm I'm confused by the changes to AppendingCodec: its not really a codec-wrapper its just coded that way.

Actually the whole thing is very dependent on the implementation details of the current term dictionary right? thats the only place that currently seeks, where it changes the behavior.

so we have a number of alternatives:

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I also don't like the .dlg file etc. I think if you want to record the inner codec name you should just read/write a SegmentInfo attribute.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thinking about this more, in my opinion it could be overkill for e.g. AppendingPF to do this.

We should forget about AppendingPF and AppendingCodec and just look at what it really is: a wrapper around BlockTree.

In my opinion the easy win here is to ensure that if we change BlockTree, Appending correctly gets IndexTooOld/IndexTooNew exceptions.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

AppendingCodec extends Lucene40Codec and returns AppendingPF for every field.

But I think we would like it to extend Lucene43Codec when we release it and this will break back compat for Appending?

remove AppendingCodec entirely (just move to test-framework), because someone could do the above themselves.

I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec.

+1 for your patch

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I like this idea better. People who want an appending codec just need to extend the last Lucene4xCodec.getPostingsFormatForField and there will be no back compat issue. Maybe we should just advertise in Lucene4xCodec javadoc that this is how to obtain an appending codec.

Right my only hesitation before here was 'exposing our guts' a bit too much. For example, this scheme breaks if in lucene 4.2 you write a fancy new default stored fields format that needs to seek-on-write.

But I think we shouldnt worry about this: we can just address it as it comes (and if a Codec makes sense then, lets deal with it).

Its like the analysis package, if we implement something as a TokenFilter thats ok. We don't necessarily need to hide ourselves by providing an Analyzer too. If we want to make it a Tokenizer later because that makes more sense, then we just do that :)

Otherwise we will have a lot of delegates and complex code for only theoretical future situations and I think we should avoid this.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 for the patch (correctly detecting format changes & throwing TooNew/Old) and for moving AppendingCodec to test-framework.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think we just need to look at how we can refactor this thing so it works the way we want.

What about having a public subclass of PostingsFormat that would enforce the decoupling of fields/terms and postings? Pulsing could only wrap instances of this subclass and would throw an exception when loading the segment if the declared delegate postings format does not extend this class. (Just thinking out loud...)

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@rmuir Maybe your last patch could/should be committed to 4.0?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I feel like the safety might be a good idea (even though I hate the idea of doing it now).

Would be good to see what anyone else thinks about this.

I don't like that the Codec check is currently "false sense of security" if we change BlockTree.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch for 4.1:

Personally I think i like this better.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+        if (indexVersion != version) {
+          throw new CorruptIndexException("mixmatched version files: " + in + "=" + version + "," + indexIn + "=" + indexVersion);
+        }

If we force the terms version and the terms index version to be the same, maybe we should merge TERMS_INDEX_VERSION_CURRENT and TERMS_VERSION_CURRENT into a single constant?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We dont force them to be the same. But to date they have been, and its currently still the case, so i check it.

in the future its possible we might need to bump something about just the terms index file or whatever.

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If it makes sense to check that the versions of both files are consistent, maybe we should still merge the version numbers into a single one and bump it whenever any of the file formats changes? (similarly to Lucene40StoredFieldsWriter)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dont its a goal to check that they are really consistent? Really we can remove this check.

I think every file should have its own name and version number. If XYZ writer has 10 files, then it should have 10 independent codec names and version numbers.

This was a bug in this stuff before (#4695).

For the stored fields writer i'm not pedantic enough to go fix the fact it shares version numbers, but I'm just saying we really have to always look at this conceptually as a per-file thing.

Otherwise we defeat a lot of the purpose of having a codec header at all (verifying this file really is what you think it is, and the version you think it is).

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I dont its a goal to check that they are really consistent? Really we can remove this check.

I actually started to like it. :-) Having two different version numbers in the headers of these files would mean that something went really wrong...

This was a bug in this stuff before (#4695).

I agree that codec names should never be shared, but I don't think this is a problem for version numbers.

I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch.

we can deprecate/remove appending codec, we should just keep its test.

+1 for moving AppendingCodec from lucene-codecs to test-framework too

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I started this discussion because I was surprised of the indexVersion == version test, but this is a detail, I think it is nice not to seek when it is not necessary so I am +1 for this patch.

Yeah, its a little funky, because of how there is only "one" seekDir, and this is protected so Appending can subclass it (we can remove this when Appending goes away).

So that one seekDir (used for both files) needs to know what should happen: I suppose it could be 'boolean append' or something to be more clear.

I didnt want to have an assert though, or it might invoke Uwe's wrath!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

A more thorough patch for trunk. (Needs javadocs and minor cleanups etc etc, but all backwards compat and such).

I would deprecate this stuff in branch_4x still.

I would also, in branch_4x, include my original fix on this issue for not including blocktree's versioning. this itself can also be backwards compat (its just an if statement).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch (just docs-only). If the .tip part looks confusing, its just because the current doc for the terms index format was missing any mention of DirOffset.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Any other opinions on this patch?

asfimport commented 12 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 Maybe we should add a note in Lucene40Codec docs saying that this codec works on append-only file-systems?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1396049

LUCENE-4399: Deprecate AppendingCodec

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.