apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Column-stride fields (aka per-document Payloads) [LUCENE-1231] #2308

Closed asfimport closed 13 years ago

asfimport commented 16 years ago

This new feature has been proposed and discussed here: http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results

Currently it is possible in Lucene to store data as stored fields or as payloads. Stored fields provide good performance if you want to load all fields for one document, because this is an sequential I/O operation.

If you however want to load the data from one field for a large number of documents, then stored fields perform quite badly, because lot's of I/O seeks might have to be performed.

A better way to do this is using payloads. By creating a "special" posting list that has one posting with payload for each document you can "simulate" a column- stride field. The performance is significantly better compared to stored fields, however still not optimal. The reason is that for each document the freq value, which is in this particular case always 1, has to be decoded, also one position value, which is always 0, has to be loaded.

As a solution we want to add real column-stride fields to Lucene. A possible format for the new data structure could look like this (CSD stands for column- stride data, once we decide for a final name for this feature we can change this):

CSDList --> FixedLengthList | <VariableLengthList, SkipList> FixedLengthList --> <Payload>^SegSize VariableLengthList --> <DocDelta, PayloadLength?, Payload> Payload --> Byte^PayloadLength PayloadLength --> VInt SkipList --> see frq.file

We distinguish here between the fixed length and the variable length cases. To allow flexibility, Lucene could automatically pick the "right" data structure. This could work like this: When the DocumentsWriter writes a segment it checks whether all values of a field have the same length. If yes, it stores them as FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger merges two or more segments it checks if all segments have a FixedLengthList with the same length for a column-stride field. If not, it writes a VariableLengthList to the new segment.

Once this feature is implemented, we should think about making the column- stride fields updateable, similar to the norms. This will be a very powerful feature that can for example be used for low-latency tagging of documents.

Other use cases:

Things that need to be done here:

I would like to get this feature into 2.4. Feedback about the open questions is very welcome so that we can finalize the design soon and start implementing.


Migrated from LUCENE-1231 by Michael Busch, 8 votes, resolved Jun 09 2011 Linked issues:

asfimport commented 16 years ago

Doug Cutting (@cutting) (migrated from JIRA)

How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

I think this would be very similar, except maybe:

This would tie into the #1906, so that you could load these fields entirely in RAM, incrementally update them from a reopen, etc.

asfimport commented 16 years ago

Doug Cutting (@cutting) (migrated from JIRA)

So there are a number of features these fields would have that differ from other fields:

My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable.

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry you're right: the payload is the binary data.

So there are a number of features these fields would have that differ from other fields:

Maybe add "stored in its own file" or some such, to that list. Ie to efficiently update field X I would think you want it stored in its own file. We would then fully write a new geneation of that file whenever it had changes.

I agree it would be great to implement this as "flexible indexing", such that these are simply a-la-cart options on how the field is indexed, rather than make a new specialized kind of field that just does one of these "combinations". But I haven't wrapped my brain around what all this will entail... it's a biggie!

BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

I like that!

asfimport commented 16 years ago

Michael Busch (migrated from JIRA)

I started working on this, but it shouldn't block the 2.4 release.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

While working on this I got kind of stuck on the ingestion side. I think we need to make FieldInfos and Fields more flexible. If I introduce a column-stride field, then I also need to add something to FieldInfos. So I could add another bit indicating a column-stride field. But the other bits don't make much sense in combination with this.

Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API.

So I think we should try for 3.0 to overhaul the document/field/fieldinfos APIs. I have some ideas which I started hacking during a long flight. I'll try to summarize the ideas/goals I'd have for such a new API and send it to java-dev.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

For the search side we need an API similar to TermDocs and Payloads, let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have next(), skipTo(), doc(), etc. methods. However, the way TermPositions#getPayloads() currently works is that it always forces you to copy the bytes from the underlying IndexInput into the payload byte[] array. Since we usually use a BufferedIndexInput, this is then an arraycopy from BufferedIndexInput's buffer array into the byte array.

I think to improve this we could allow users to call methods like readVInt() directly on the CSFA. So I was thinking about adding DataInput and DataOutput as superclasses of IndexInput and IndexOutput. DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc.

So then CSFA would extend DataInput or alternatively have a getDataInput() method. The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. But we could offer e.g.:

  static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) {
    if (mode == Mode.Fast) {
      return in;
    } else if (mode == Mode.Safe) {
      return new SafeAccessor(in);
    }

The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. This is of course overhead, but users could use the SafeAccessor until they're confident that everything works fine in their system, and then switch to the fast accessor for better performance.

If there are no objections I will open a separate JIRA issue for the DataInput/Output patch.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API.

By creating a custom indexing chain you could actually write CSF, today.

But the lack of extensibility of Field needs to be addressed: you need some way to store something arbitrary & opaque into a field such that your indexing chain could pick it up and act.

And FieldInfos also needs "store this opaque thing for me" API.

One of the big changes in #2532 is to strongly separate different fields on the read APIs. EG there is a separate FieldsEnum from TermsEnum, meaning you first seek to the field you want, then request a TermsEnum from that, which can iterate through the terms only for that field. It's the codec's job to return the right TermsEnum for a given field.

Not to delay 2.9 further, but... I also wonder if Lucene had NumericField (say), how it would simplify things here. EG, today, if I have a field "weight" that is a float, I'm going to have to set something to tell the CSF (man the similarity of that to CFS is going to cause problems!) writer to cast-it-and-save-it-as-float-array to disk; I'm going to have to tell the TrieRangeUtil to do the same, etc. It'd be much better if that field stored a float (not String), and if it default "naturally" to using these two special indexers...

DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc.

What is the fastest way in Java to slurp in a bunch of bytes as an int[], short[], float[], etc? Seems that we need to answer that first and then work out how to fix our store APIs to enable that. (Maybe it's IntBuffer wrapping ByteBuffer, instead of an int[]?).

The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior.

I think it's better to have good performance with added risk of danger, then forced handholding always.

The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume.

I generally prefer liberal use of asserts to trip bugs like this, instead of explicit strongly divoced code paths / classes / modes etc., containing real if statements at production runtime.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One interesting idea, from Earwin's comment here:

https://issues.apache.org/jira/browse/LUCENE-1584?focusedCommentId=12696583&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696583

is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

+1 Making it automatic makes sense.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

Hmm, I'm not sure about this. The Lucene store is optimized for the case when ALL stored fields for a document need to be loaded. E.g. to render search results you usually need to load multiple fields. Then only one disk seek is needed to seek to the right document. Loading the fields is then just one sequential I/O.

If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

My general rule of thumb is to use payloads/CSFs for data you need during hit collection (e.g. a UID field or e.g. the norms, which is essential a CSF with fixed size), but ordinary stored fields to load data for rendering search results.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

If the CSFs have been loaded into the FieldCache then it's fast. If they haven't, I agree (one seek per doc).

Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache & from stored fields, and merge). EG if I store body text as stored field, but then have a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache (loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

I can share my design for doc loading, if anybody needs it:

public interface FieldCache { DocLoader loader(FieldInfo<?>... fields); .... }

public interface DocLoader { void load(Doc doc); <T> T value(FieldInfo<T> field); }

Doc is my analog for ScoreDoc, for these purporses it is completely identical FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its name, indexing method, whether it is cached or not and the way it is cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related.

Workflow looks like this:

// I create a loader. Fields are checked against the cache, for those that aren't cached I create a FieldSelector loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field));

// Then for each document I'm going to send in response for search request I select this document // an indexReader.document(fieldSelector) happens here if there are any uncached fields loader.load(doc);

// Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded from Document retrieved in previous step hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) // etc, etc

Having a single API to retrieve values regardless of the way they are stored/cached is very handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached fields is massive win, it becomes even more massive if all required fields happen to be cached.

asfimport commented 15 years ago

Marvin Humphrey (migrated from JIRA)

FWIW, I think priority for document fetching should be to optimize for search clusters where you have separate index servers and document/excerpt servers. Under such a system, it's clear that you'd want to use the standard stored fields.

To my mind, column stride fields are more of a search tool – an extension of the field caches and the "flexible indexing" concept – and I don't think the design should be encumbered or deployment slowed so that they can perform double duty.

However, I think column-stride fields may be more of an imperative for Lucy than Lucene. In Lucy, our sort caches will be either column-stride or variable width and will be written out at index-time – mmap a one-value-per-doc column-stride file, and voila: instant sort cache.

> Loading a mix of cached/uncached fields is massive win

I don't quite understand why that would be the case. Presuming a cold OS cache, the big cost is the .fdt file disk seek. Once you're there, how much of a difference is it to read the field off disk as opposed to reading it from the cache?

> it becomes even more massive if all required fields happen to be cached.

I can see that. However, the gain won't be all that significant for systems where the index fits into RAM, or when the persistant storage device is an SSD. And of course a different caching strategy altogether (popular document caching) is best for dedicated doc servers.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

To my mind, column stride fields are more of a search tool - an extension of the field caches and the "flexible indexing" concept

I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API.

However, I think column-stride fields may be more of an imperative for Lucy than Lucene. In Lucy, our sort caches will be either column-stride or variable width and will be written out at index-time - mmap a one-value-per-doc column-stride file, and voila: instant sort cache.

I think column or row stride, vs fixed or variable width storage, are orthogonal issues?

EG column-stride fields (in Lucene) could very well be variable width storage (eg for storing String values).

I can see that. However, the gain won't be all that significant for systems where the index fits into RAM, or when the persistant storage device is an SSD.

SSDs, as fast as they are, are still orders of magnitude slower than RAM (assuming of course the OS hasn't swapped your RAM out to your SSD ;).

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

this has been implemented in #4181, #4009, #3244 and LUCENE-1231 moving out