Break out StorableField from IndexableField [LUCENE-3312]

asfimport commented 13 years ago

In the field type branch we have strongly decoupled Document/Field/FieldType impl from the indexer, by having only a narrow API (IndexableField) passed to IndexWriter. This frees apps up use their own "documents" instead of the "user-space" impls we provide in oal.document.

Similarly, with #4382, we've done the same thing on the doc/field retrieval side (from IndexReader), with the StoredFieldsVisitor.

But, maybe we should break out StorableField from IndexableField, such that when you index a doc you provide two Iterables – one for the IndexableFields and one for the StorableFields. Either can be null.

One downside is possible perf hit for fields that are both indexed & stored (ie, we visit them twice, lookup their name in a hash twice, etc.). But the upside is a cleaner separation of concerns in API....

5702
- 5413

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

once we are on this here we should think about breaking out storing fields from the indexing chain. I think this could easily be a separated process such that stored fields are not written by the indexing chain but once all fields are indexed. This would make the indexing chain somewhat cleaner I think.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

With #3384 out of the way, I've started looking into this more deeply. Changing the indexer code has not been especially difficult since there is already a clear separation in the handling of indexed and stored fields. The challenges are in the consumer / user code. So I have a couple of questions I'm hoping for some opinions on:

Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata?
By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata?

Right, this has been a long standing problem w/ the Document class you load at search time, ie the fields "pretend" to carry over the details from indexing. But it's buggy now, eg boost is not carried over, and the indexed bit is "global" (comes from field info) while the "tokenized" bit used to be per-doc, before #3384.

So I consider this (these indexing details are no longer available when you pull the document) a big benefit of cutting over to StorableField. Ie, its trappy today since it's buggy, so we'd be removing that trap.

By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality?

Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances?

Maybe we can name the new class oal.index.Indexable? It's a trivial class, just exposing .indexableFieldsIterator and .storableFieldsIterator?

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances?

Yeah I have made Field implement both types. I have also left stored() and indexed() on Field while removing them from the respective interfaces.

Maybe we can name the new class oal.index.Indexable? It's a trivial class, just exposing .indexableFieldsIterator and .storableFieldsIterator?

Absolutely.

Thanks for your help Mike!

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields?

To me it appears that we need some clarification what DocValues are. Actually, when you think about it Stored Fields and DocValues have a lot in common. A Stored Field is basically a DocValues DerefVarBytes type and maybe down the road we should think about merge those two types together. It would be nice to have only one typesafe API that can store whatever you want and based on the codec lucene would decide how to store it on disk ie. if it is a multi field container like Stored Fields are done today or if the values are split appart like DocValues does it today. For now we should try to differentiate between and InvertedField and a StoredField ie. everything which is not an InvertedField is a StoredField. The API could basically already reflect that DocValues and StoredFields are the same and simply specify a type like Store.Packed vs. Store.ColumnStride or something like that. If we do that we could also expose loading "Packed" Fields via PerDocValues and have one API for our users.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Just for clarification, Packed refers to the notion of a stored field? or am I lost?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Just for clarification, Packed refers to the notion of a stored field? or am I lost?

yes, since we pack all fields together into one location and store only the offset to the first field per document. I just used that term here to differentiate, sorry for the confusion

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Given what you say about the similarities between stored fields and DocValues and the direction we seem to be heading, I think its a good term to start using.

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances?

Hm I'm going round in circles on this. For building and indexing a Document, having the class hold Field instances is easiest and the most clean option. However this then means we are once again providing Field instances in the Document returned by reader.document(), meaning we lose:

So I consider this (these indexing details are no longer available when you pull the document) a big benefit of cutting over to StorableField. Ie, its trappy today since it's buggy, so we'd be removing that trap.

Thoughts? :/

asfimport commented 13 years ago

Chris Male (migrated from JIRA)

So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields. I like this idea since it gets people over the hump of piggybacking Field, but it is a bw compat break. Any objections?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree DocValues seem both indexed and stored, but they are closer to stored field so let's put them under StorableField.

And indeed we could impl stored fields as a DerefVarBytes doc values field, but I think we should hold off on unifying this in the APIs we are creating here?

Ie, StorableField should just have a .docValues() method and if that returns non-null value the indexer will index those doc values (likewise for .stringValue(), binaryValue(), etc.)?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields.

+1

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

Back on this wagon for a bit.

Just wondering about whether we need a StorableFieldType to accompany StorableField.

At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages:

Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction.
It separates the 'data type' away from the actual value.

If these properties were to stay on StorableField, I can't really see the need for a StorableFieldType.

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Hi folks,

I think this one could be a nice addition to my last year GSOC and I would like to take on it for this year GSOC. Mike also suggested using different class for reader.document() so this means I would also do #4964

What do you guys think?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hi Nikola,

I think this plus #4964 sounds great! The challenge is... we need a mentor for this project... volunteers?

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Hi guys,

I submited first draft of proposal @ http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/ntankovic/30002#

I'm welcome for corrections and further proposals.

Ty!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Michael, I would take the mentorship. We will have an IRC meeting for a short interview tomorrow (Sat, 14) on #lucene-dev at 13:00 UTC.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We had the GSoC interview today: http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2012-04-14#l27

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

I want to thank you guys for the opportunity to be part of GSOC 2012. Let the coding begin! You rock!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Nikola, I am glad to work with you this year! I did not get the official confirmation from Google until now, but I will check out later! Once back at home (I am on business trip), we should meet this week maybe in IRC to make a plan for the following weeks!

Uwe

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Nikola, what's the plan to start with?

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

While at the Lucene Revolution Conference, Mike and me discussed about "hiding" the internal Lucene document IDs from the public search/TopDocs API. As this is very related to this issue, we may thing about implementing this.

My idea was to let TopDocs directly return the SearchDocument classes, only addressed by its slot in the TopDocs. This way, we prevent the ongoing problems with users "thinking" that Lucene internal document IDs should be stable (which they aren't).

Of course in the expert APIs (Scorer, DocIDSet, FoildCache...) we still have the internal DocIDs, but the public facing APIs (executing a query and getting TopDocs) should not use internal DocIds. Robert already opened a new issue to "resurrect the crazy Hits" class (LUCENE-4052, but with it's problems fixed).

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I'd argue that TopDocs is also expert level and core (e.g. see TopFieldCollector.create()) hence we shouldn't hide the IDs as that would seriously be diminishing functionality.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The TopDocs class is what IndexSearcher.search() returns, so why is it expert?

asfimport commented 12 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The TopDocs class is what IndexSearcher.search() returns, so why is it expert?

Because it's not just a friendly wrapper around other expert level classes. Remove access to the internal ID there, and it gets extremely difficult to do some types of searches where you want the IDs. For example: the TopDocsCollector (expert level) classes return TopDocs.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

My intention wa snot to remove it completely, the idea was to make TopDocs internally hold the IndexSearcher, too, so it can return the Document instance (or what we have here after the refactoring). Returning the docId would then be "expert", but user-facing code could call "Document TopDocs.getDocument(slot)".

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But its wrong now to put an IndexSearcher in TopDocs, as it can contain results from multiple searchers (TopDocs.merge)

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert, I agree about that! I am just discussing ideas, maybe create separate issue about that.

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Patch 01: Introduces StoredDocument, Patch 02: Continues along path in patch 01, introduces StorableField

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Patch 03: Core compiles, and almost all test pass.

The problem remains: should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great!

I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those iterators to know what fields to index/store?

should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments?

I think ideally we would have only a StorableFieldType accessible in the loaded document? Ie, such that, it would in fact be safe to turn around and re-use that StoredFields instance for adding stored fields to another document your are about to index?

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those iterators to know what fields to index/store?

I agree fully with this. We really want to decouple Storable and Indexable concepts as much as possible.

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

OK, I agree with decoupling. Can I make GeneralField interface with methods like "getName" and "getXXXValue"? Then, both Indexable and Storable fields can extend that?

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

I'm not sure that's needed. In the end IndexableField shouldn't have any getXXXValue() methods, it should only return a TokenStream. Only StorableField would need those methods.

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

OK, thank you! I was not sure about that.

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Patch 04: Status > core compiles.

This is an attempt to separate IndexableFields and StorebleFields in indexing.

I introduced oal.index.Document which holds both type of fields.

I also introduced StorableFieldType interface, StoredFieldType class.

Let me know what you think!

asfimport commented 12 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

Comments to patch 04:

index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators...
as noted on IRC, this strong decoupling of stored and indexed content poses some interesting questions:
- since you can add multiple fields with the same name, you can now add an arbitrary sequence of Stored and Indexed fields (all with the same name). This means that you can now store parts of a field that are not indexed, and parts of a field that are indexed but not stored.
- previously, if a field was flagged as indexed but didn't have a tokenStream, its String or Reader value would be used to create a token stream. Now if you want a value to be stored and indexed you have to add two fields with the same name - one StoredField and the other an IndexedField for which you create a token stream from the value. My assumption is that StoredField-s will never be used anymore as potential sources of token streams?
maybe this is a good moment to change all getters that return arrays of fields or values to return List-s, since all the code is doing underneath is collecting them into lists and then converting to arrays?
previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields?
minor nit: there's a grammar mistake in Field.setTokenStream(..): "TokenStream fields tokenized".

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators...

I'm not sure that is such a big deal. But I do think should think about the name here. We already have Document and it's going to become confusing with two different Document classes kind of doing the same thing and with document.Document implementing index.Document as well.

previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields?

I think we need to simplify the document.Document API. I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances. It is a userland kind of class and so is Field. We should make it easy for people to add the Fields that they want. If they want to have a Field which is both indexed and stored, then they can create it once and add it to Document. If they want to do it separately, then they can do that too. Since Field implements both IndexableField and StorableField, it can serve the dual purpose.

That way the API in document.Document is pretty simple and you can add and remove things as done in the past.

asfimport commented 12 years ago

Eks Dev (migrated from JIRA)

My assumption is that StoredField-s will never be used anymore as potential sources of token streams?

One case where it might make sense are scenarios where a user wants to store analyzed field (not original) and later to to read it as TokenStream. Kind of TermVector without tf. I think I remember seing great patch with indexable-storable field (with serialization and deserialization).

A user can do it in two passes, but sumetimes it is a not chep to analyze two times

asfimport commented 12 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

We already have Document and it's going to become confusing with two different Document classes

+1 to use a better name (LuceneDocument? AbstractDocument?).

I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances.

With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField? Well, at least it would allow for expressing the association between consecutive stored/indexed values that we can express now when creating a Document for indexing. But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField.

I think I remember seing great patch with indexable-storable field (with serialization and deserialization).

SOLR-1535 .

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField?

Yes, Nikola has included a StoredDocument class for that. This would prevent users from thinking they can just take a search result and pass it into being indexed. It creates a clear separation between indexing and search results.

But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField.

I agree that there are benefits to the decoupling. It's just that one of the important factors in this issue and other work in and around Document & Field is creating a cleaner API for users. I'm not sure bogging the document.Document API down with having to manage both Storable and IndexableField instances is worth it. Field is already basically a parent class with the extensive list of specializations we now have.

I'm wondering whether expert users who are using their own Storable/IndexableField impls will also want their own 'Document' impls as well, maybe to support direct streaming of fields or something. If we enforce this, then we're have a consistent policy that to use these expert interfaces, you're going to have to provide your own implementations for everything.

With all that said, I'm open to a clean API in Document that can do everything :)

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think I like this decoupling ... for normal users I don't think this makes the API harder? They still work with TextField, FloatField, StoredField, etc.? It's just that, under the hood, these sugar classes extend from the right base class (indexed or stored).

Document.add is just type overloaded, but Document.get* will get messier: we'll need getStored and getIndexed? I guess that would be simpler if Document could just store Field instances... hmm.

It would also be less invasive change for migrating from 4.0 -> 5.0 (assuming this issue is done only for 5.0...) if we didn't do the hard split.... else we need a back-compat story...

We already have Document and it's going to become confusing with two different Document classes

+1 to use a better name (LuceneDocument? AbstractDocument?).

Maybe IndexDocument? I think it's OK as an interface if we mark it @lucene.internal? This is the raw, super expert low-level that indexer uses to consume documents... it has only 2 methods, and I think for expert users it could be a hassle if we force the impl to inherit from our base class...

Should StoredDocument (returned from IR.document) be "read only"? Like you can iterate its fields, look them up, etc., but not eg remove them?

We should probably rename document.Field -> document.IndexedField and document.Field -> document.IndexedFieldType?

Also I think we should rename XXXField.TYPE_UNSTORED -> .TYPE, since in each case there's only 1 TYPE instance for that sugar field?

Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED from all the sugar fields (TextField, StringField, etc.); expert users can always make a custom Indexable/Storable/Field/FieldType that both stores & indexes...

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

I am all for the decoupling too, just want to thoroughly kick the tyres on this one :D I dont want another FieldType like discussion.

Document.add is just type overloaded, but Document.get* will get messier: we'll need getStored and getIndexed? I guess that would be simpler if Document could just store Field instances... hmm.

Perhaps if we just limit the API in Document we can handle this okay. We can provide the overloaded add methods, two get methods and 1 remove method.

Maybe IndexDocument? I think it's OK as an interface if we mark it @lucene.internal? This is the raw, super expert low-level that indexer uses to consume documents... it has only 2 methods, and I think for expert users it could be a hassle if we force the impl to inherit from our base class...

+1 to both the name and the handling of the interface.

Should StoredDocument (returned from IR.document) be "read only"? Like you can iterate its fields, look them up, etc., but not eg remove them?

+1 You shouldn't really need to remove fields, you can achieve that by not retrieving them in the first place

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED from all the sugar fields (TextField, StringField, etc.); expert users can always make a custom Indexable/Storable/Field/FieldType that both stores & indexes...

I opened a new issue for this: #5173.

Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts!

asfimport commented 12 years ago

Nikola Tankovic (migrated from JIRA)

Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts!

OK, let me know when you finish!

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I am back @ office, trying to catch up.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think, given the discussions in #5173, that the document.FieldType class should implement both index.IndexableFieldType and index.StorableFieldType, so that users can add a single Field instance that's both stored and indexed ...

asfimport commented 12 years ago

Chris Male (migrated from JIRA)

Following on from that, I don't think we should expose the IndexableField / StorableField decoupling in document.Document. It should remain an easy to use class for the most common use cases. In which case I think it should only use Field instances. We can then do some work for it to meet the needs of index.IndexDocument. Having it this way means users can choose whether they want to add a single instance for both stored and indexed, or two instances.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think as we have 2 interfaces, we should only allow the interface StorableField added to StoredDocument and vice versa. The user still can add combined fields, as both interfaces are implemented? Do I miss something?

apache / lucene

Break out StorableField from IndexableField [LUCENE-3312] #4385

5702

5413