apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 989 forks source link

Make CompressingStoredFieldsFormat the new default StoredFieldsFormat impl [LUCENE-4509] #5575

Closed asfimport closed 11 years ago

asfimport commented 11 years ago

What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?

Stored fields compression has many benefits :

Things to know:

I think we could use the same default parameters as in CompressingCodec :

Any concerns?


Migrated from LUCENE-4509 by Adrien Grand (@jpountz), 1 vote, resolved Nov 13 2012 Attachments: LUCENE-4509.patch (versions: 2)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I am a strong +1 for this idea.

I only have one concern, about the defaults. How would this work with laaaarge documents (e.g. those massive Hathitrust book-documents) that might be > 16KB in size?

Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets think of ways to optimize that case.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Nice timing Adrien... I was just going to ask how we could enable this easiest in Solr (or if it should in fact be the default).

One data point: 100GB of compressed stored fields == 6.25M index entries == 75MB RAM That seems decent for a default.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think its ok too. I just didnt know if we could do something trivial like store the offsets-within-the-blocks as packed ints, so that it optimizes for this case anyway (offset=0) and only takes a 8bytes+1bit instead of 12 bytes.

But i don't have a real understanding of what this thing does when docsize > blocksize, i havent dug in that much.

in any case I think it should be the default: its fast and works also for tiny documents with lots of fields. I think people expect the index to be compressed in some way and the stored fields are really wasteful today.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'd say to make progress for the default we want to look at:

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

How would this work with laaaarge documents that might be > 16KB in size?

Actually 16kB is the minimum size of an uncompressed chunk of documents. CompressingStoredFieldsWriter fills a buffer with documents until its size is >= 16kb, compresses it and then flushes to disk. If all documents are greater than 16kB then all chunks will contain exactly one document.

It also means you could end up having a chunk that is made of 15 documents of 1kb and 1 document of 256kb. (And in this case there is no performance problem for the 15 first documents given that uncompression stops as soon as enough data has been uncompressed.)

Does this mean with the default CompressingStoredFieldsIndex setting that now he pays 12-bytes/doc in RAM (because docsize > blocksize)? If so, lets think of ways to optimize that case.

Probably less than 12. The default CompressingStoredFieldsIndex impl uses two packed ints arrays of size numChunks (the number of chunks, <= numDocs). The first array stores the doc ID of the first document of the chunk while the second array stores the start offset of the chunk of documents in the fields data file.

So if your fields data file is fdtBytes bytes, the actual memory usage is \~ {{numChunks * (ceil(log2(numDocs)) + ceil(log2(fdtBytes))) / 8}}.

For example, if there are 10M documents of 16kB (fdtBytes \~= 160GB), we'll have numChunks == numDocs and a memory usage per document of (24 + 38) / 8 = 7.75 => \~ 77.5 MB of memory overall.

100GB of compressed stored fields == 6.25M index entries == 75MB RAM

Thanks for the figures, Yonik! Did you use RamUsageEstimator to compute the amount of used memory?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

But if we worry about this worst-case (numDocs == numChunks), maybe we should just increase the chunk size (for example, ElasticSearch uses 65 kB by default).

(Another option would be to change the compress+flush trigger to something like : chunk size >= 16 kB AND number of documents in the chunk >= 4.)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well you say you use a separate packed ints structure for the offsets right? so these would all be zero?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

should we s/uncompression/decompression/ across the board?

If decompression sounds better, let's do this!

here is some scary stuff (literal decompressions etc) uncovered by the clover report. We should make sure any special cases are tested.

I can work on it next week.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Well you say you use a separate packed ints structure for the offsets right? so these would all be zero?

These are absolute offsets in the fields data file. For example, when looking up a document, it first performs a binary search in the first array (the one that contains the first document IDs of every chunk). The resulting index is used to find the start offset of the chunk of compressed documents thanks to the second array. When you read data starting at this offset in the fields data file, there is first a packed ints array that stores the uncompressed length of every document in the chunk, and then the compressed data. I'll add file formats docs soon...

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, I'm referring to the second packed ints structure (start offset within a block)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Committed:

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think Adrien accidentally resolved the wrong issue :)

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a patch that adds a new Lucene41StoredFieldsFormat class with file format docs.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I forgot to say: oal.codecs.compressing needs to be moved to lucene-core before applying this patch.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Do we know whats happening with the recent test fail?

ant test  -Dtestcase=TestCompressingStoredFieldsFormat -Dtests.method=testBigDocuments -Dtests.seed=37812FE503010D20 -Dtests.multiplier=3 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/hudson/lucene-data/enwiki.random.lines.txt -Dtests.locale=es_PR -Dtests.timezone=America/Sitka 
asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think I abuse atLeast to generate documents sizes and because the test ran with tests.multipliers=true and tests.nightly=true, documents got too big, hence the OOME. I'll commit a fix shortly.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In the fdt we write docBase of the first document in the chunk: Can you explain why this is needed?

We already redundantly write this in the fdx right? (or in the DISK_DOC case its implicit).

It seems to me in visitDocument() we should be getting the docBase and startPointer too from the index, since it knows both.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Actually i guess we dont know it for DISK_DOC. But it seems unnecessary for MEMORY_CHUNK ?

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Right, the docBase could be known from the index with MEMORY_CHUNK, but on the other hand duplicating the information helps validating that we are at the right place in the fields data file (there are corruption tests that use this docBase). Given that the chunk starts with a doc base and the number of docs in the chunk, it gives the range of documents it contains. The overhead should be very small given that this VInt is repeated at most every {compressed size of 16KB}. But I have no strong feeling about it, if you think we should remove it, then let's do it.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't feel strongly about it either... was just reading the docs and noticed the redudancy.

But you are right: its just per-chunk anyway. And i like the corruption check...!

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Updated file format docs, you need to move lucene/codecs/src/java/org/apache/lucene/codecs/compressing to lucene/core/src/java/org/apache/lucene/codecs in addition to applying the patch.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Docs look good, +1 to commit.

A few suggestions:

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks Robert for your comments, I replaced "documents" with "individual documents" and added a link to the protobuf docs.

Committed:

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1416082

Move oal.codec.compressing tests from lucene/codecs to lucene/core (should have been done as part of LUCENE-4509 when I moved the src folder).

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1408796

LUCENE-4509: Enable stored fields compression in Lucene41Codec (merged from r1408762).

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1404276

LUCENE-4509: New tests to try to break CompressingStoredFieldsFormat... (merged from r1404275)

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1403032

LUCENE-4509: improve test coverage of CompressingStoredFieldsFormat (merged from r1403027).