apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.02k forks source link

Should we revisit the default numeric precision step? [LUCENE-5609] #6671

Closed asfimport closed 10 years ago

asfimport commented 10 years ago

Right now it's 4, for both 8 (long/double) and 4 byte (int/float) numeric fields, but this is a pretty big hit on indexing speed and disk usage, especially for tiny documents, because it creates many (8 or 16) terms for each value.

Since we originally set these defaults, a lot has changed... e.g. we now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict, a faster postings format, etc.

Index size is important because it limits how much of the index will be hot (fit in the OS's IO cache). And more apps are using Lucene for tiny docs where the overhead of individual fields is sizable.

I used the Geonames corpus to run a simple benchmark (all sources are committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields, with these numeric fields:

I tested 4, 8 and 16 precision steps:

indexing:

PrecStep        Size        IndexTime
       4   1812.7 MB        651.4 sec
       8   1203.0 MB        443.2 sec
      16    894.3 MB        361.6 sec

searching:

     Field  PrecStep   QueryTime   TermCount
 geoNameID         4   2872.5 ms       20306
 geoNameID         8   2903.3 ms      104856
 geoNameID        16   3371.9 ms     5871427
  latitude         4   2160.1 ms       36805
  latitude         8   2249.0 ms      240655
  latitude        16   2725.9 ms     4649273
  modified         4   2038.3 ms       13311
  modified         8   2029.6 ms       58344
  modified        16   2060.5 ms       77763
 longitude         4   3468.5 ms       33818
 longitude         8   3629.9 ms      214863
 longitude        16   4060.9 ms     4532032

Index time is with 1 thread (for identical index structure).

The query time is time to run 100 random ranges for that field, averaged over 20 iterations. TermCount is the total number of terms the MTQ rewrote to across all 100 queries / segments, and it gets higher as expected as precStep gets higher, but the search time is not that heavily impacted ... negligible going from 4 to 8, and then some impact from 8 to 16.

Maybe we should increase the int/float default precision step to 8 and long/double to 16? Or both to 16?


Migrated from LUCENE-5609 by Michael McCandless (@mikemccand), resolved May 05 2014 Attachments: LUCENE-5609.patch

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another test, this time on a biggish (\~1B docs) simulated timestamps over a 1 day range with msec precision (benchmark sources are in luceneutil):

indexing:

PrecStep        Size    IndexTime
       4      6.7 GB     2482 sec
       8      4.0 GB     2229 sec
      16      2.8 GB     2213 sec

Curiously index speed did not change much (this test only indexes the one LongField), but I did re-use the LongField instance across addDocument calls, and I used 12 indexing threads.

searching:

PrecStep     QueryTime         TermCount
       4      29.7 sec        1285 terms
       8      29.9 sec       11216 terms
      16      30.6 sec     1410453 terms

Much smaller slowdown in the queries as precStep is increased ... I suspect because the numbers are so "dense" and many docs have the timestamp, but still a big reduction in index size.

Net/net I think we should increase the default precStep?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1, I think the 4.0 MTQ rework made a lot of difference here.

One day we should followup with re-examination of AUTOREWRITE too, but this one is much more important because of index size, etc.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

When the current implementation can only handle precision steps that are powers of 2, going to default 8 looks good to me.

Anyway these results make me curious about precision steps 6 as searching speed default, and 11 as indexing speed default.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To be simple, i think a change we make has to be a multiple of 4: the current default.

This way we have easy backwards compat with existing indexes for users that just go with the default (since NumericRangeQuery has ctors that use the default precisionStep, and multiples of the index-time work at query-time).

Otherwise, it could cause a lot of confusion.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Backward compat is indeed a strong point. It's also nice to return to 8.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would use precStep 8 for ints (Solr does this already by default). As we need a multiple of 4: Mike: Can you check number of terms and index size for precStep=12? 16 is way too big in my opinion and my tests in the past.. The overhead is not soo big as you might think. The problem is if you have an index solely of numerics. To have a real comparison, you should use something like Wikipedia and maybe add something like the lastmod date as long field. And then test Also we have lots of queries with at least up to 8 different numeric fields in parallel (half open ranges). For that there is still ahuge improvement with lower prec steps. I found out that 8 is best. 16 hurts very much, if you query multiple numeric fields anded/ored together. Also not everybody has the index completely in memory! If you have a pure in-memory index, you could theoretically also disable tries completely :-) The numeric fields are made for indexes with lots of disk/ssd IO (because you have many numeric fields combined with simple full text queries and some facets.

So please also check complex queries on really large indexes, not just simple range filters on small indexes with solely numeric fields.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

To just explain, why you might have mutiple numeric fields and multiple queries: I have a customer with date ranges (they use precStep 8, 16 hurted much with ElasticSearch for a 100 GB index) and also for geo search here in-house (PANGAEA). If you have something like overlapping ranges, you need 2 queries with half open ranges. For example you have a date range on each document (start/end date of validity). The query on the index is also a date range and you want to find all documents that have overlapping ranges (validity range of document overlaps date range of query). In that case you need 2 half open queries (which are expensive with large precision steps). For stuff like bounding boxes in geo you might need if the bounding box of the document overlaps the bounding box of the query (Google Maps like query). Here you have 4 half open ranges, which almost always hit half of all your documents). With large precsteps this takes looooooooooong. So 8 is a good default, for my customer 16 took like 4 times as long as 8 (becausde of the half open ranges). With smaller precSteps half open ranges are very simple.

With geonames you can check this: geonames have in most cases bounding boxes assigned and you want to search with bounding boxes, too. This is my example above. And those ranges (unless you want to find all documents completely inside the query range) are always 4 half open ones each hitting half of all documents. By anding them together, you later get the real results (conjunctionscorer).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I see your point Uwe, however we should remember that these are the defaults for all numeric fields. In other words, the field is named IntField and LongField and so on.

This is a very general thing to the user, like a primitive data type. In fact the user may not use ranges at all, forget about complex intensive geospatial half-open ones. They might just have a numeric field for some identifier, or a simple count, or whatever.

So I feel the default precisionStep should reflect this: it should make the right tradeoffs of index time and space for range query performance, keeping in mind that its just a general numeric type and the user may not even be interested in ranges at all.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok, here is an idea for a compromise:

This patch sets a default of 16 for 64-bit types, and 8 for 32-bit types. Its pretty simple, because there is type safety everywhere.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 for 8/16.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Going from 4 to 16 for the 64 bit types is a very large step. Wouldn't it be better to do that in more steps and only take a step from 4 to 8 now?

I think 11 is better than 12. Both have an indexing cost of 3 indexed terms for 32 bits (10/11/11 and 8/12/12 precision bits per term). 11 should be faster at searching because it involves less terms. For a single ended range, the expected number of terms for these cases is about half of:

 (2**10 + 2**11 + 2**11) < (2**8 + 2**12 + 2**12) 

Whether that difference is actually noticeable remains to be seen.

Independent of the precision step, geohashes from the spatial module might help to avoid range subqueries that have large results.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Its not a large step, its that 4 was ridiculously small before.

asfimport commented 10 years ago

Paul Elschot (migrated from JIRA)

Have a look at #2544, even 2 was considered then.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The old discussions and benchmarks are irrelevant: the execution of multiterm queries and index encoding has changed substantially since then. That's the point of changing the defaults to reflect reality, we need not quadruple users indexes needlessly anymore.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Have a look at #2544, even 2 was considered then.

That was not really useable even at that time! The improvements in contrast to 4 were zero. It was even worse (because the term dictionary got larger, which had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep for longs and ints. The same applied for Solr. Lucene was the only one using 4 as default. ElasticSearch was cloning Lucene's standards.

I would really prefer to use 8 for both ints and longs. The change from 8 to 16 is increasing the number of terms while range query immense and the index size between 8 and 16 is not really a problem. To me it has also shown that because of the way how floats/doubles are encoded, the precision step of 8 is really good for longs. In most cases stuff never changes (like exponent), so there is exactly one term indexed for that.

With a precision step of 16 I would imagine the differences between 16 and 64 would be neglectible, too :-) The main reason for having lower precision steps are indexes were the values are equally distributed. For stuff like values clustered around some numbers, the precisionstep is irrelevant! In most cases because the way how it works, for larger shifts the indexed value is constant, so you have one or 2 terms that hit all documents and are never used by the range query..

So before changing the default, I would suggest to have a test with an index that has equally distributed numbers of the full 64 bit range.

I think 11 is better than 12

...because the last term is better used. The number of terms indexed is the same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But unfortunately this is not a multiple of 4, so would not be backwards compatible.

I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug.

So my proposal: Provide a 2nd field type as a 2nd default with correct documnetation, suggesting it to users, only wanting to index numeric identifiers or non-docvalues fields they want to sort on.

And second, we should do #6667 - I started with it last week, but was interrupted by NativeFSIndexCorrumpter :-) The problem is the precisionStep alltogether! We should make it an implementation detail. When constructing a NRQ, you should not need to pass it. Because of this I opened #6667, so anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an arbitrary number. Then its ensured that the people use the same settings for indexing and querying.

Together with this, we should provide 2 predfined field types per data type and remove the constant from NumericUtils completely. The 2 field types per data type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And we should make 8 the new default, which is fully backwards compatible. And hide the precision step completely! 16 is really too large for lots of queries. And difference in index size is neglectibale, unless you have a purely numeric index (in which case you should use a RDBMS instead of an Lucene index to query your data :-) ![). Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry](). Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry)

About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always uses 4 as precStep if it detects a numeric or date type. ES should change this, but maybe have a bit more intelligent "guessing". E.g., If you index the "_id" field as an integer, it should automatically use infinite (DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the "_id" field. For all standard numeric fields it should use precstep=8.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug.

Well, I kind of agree, but in a different way.

In my opinion the default numeric types (intfield, longfield, floatfield, doublefield) should have good defaults for general-purpose use. This includes range queries: they should work "reasonably" well out of box. Users that dont need range queries can optimize by changing to Infinity. Along the same lines, they also dont need to be super-optimized for "hardcore" esoteric uses of range queries. Thats what defaults are, just making the right tradeoffs for out-of-box use.

I would not be happy if these fields default to precisionStep=Infinity either, because thats also a bad default for general purpose use, just in the opposite direction of precisionStep=4.

I am fine with precisionStep=8 as the new default for both, but I don't think its the best idea. I think 16 for the 64-bit types are nice because its easy to understand "4 terms for each value". Today its 8 terms for each value (32-bit field), and 16 terms for each value (64-bit field).

I also think we should be able to add new types in the future (e.g. 16-bit short and half-float) and give them different defaults too. So, I don't understand the need for a "one-size-fits-all" default.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think 8/16 (4 terms for int/float and also 4 terms for long/double) is a better general purpose default than 8/8.

I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases.

Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry!

Really apps should not have to re-use Field instances to get good indexing performance. In #6673 I saw big gains by "specializing" how untokenized an numeric fields are indexed, and I think we should somehow do this (separately).

But the number of terms is a problem: this increases indexing time, size, and doesn't buy that much of a speedup for searching.

asfimport commented 10 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases.

Agreed – real-world data is definitely much more restricted in practice.

I wish the precisionStep was variable. If it were, I'd usually configure the precisionIncrement step to be 16,8,8,8,8,16 for doubles & longs. Variable prefix-tree precision is definitely a goal of #5987 in the spatial module. At the very high level, it's extremely rare to do gigantic continent-spanning queries, so at that level I'd like many cells (corresponds to a high precision step in trie numeric fields). And at the bottom levels, it's fastest to scan() instead of seek() because there is a limited amount of data once you get down low enough. So preferably fewer intermediate aggregate cells down there.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should do something here for 4.9; poor defaults just hurt our users.

I'd like to do 8/16, but Uwe are you completely against this?

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I plan to commit 8/16 soon ...

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1592485 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1592485

LUCENE-5609: increase default NumericField precStep

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1592521 from @mikemccand in branch 'dev/branches/branch_4x' https://svn.apache.org/r1592521

LUCENE-5609: increase default NumericField precStep