apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.03k forks source link

Allow composite readers to have more than 2B documents [LUCENE-8321] #9368

Open asfimport opened 6 years ago

asfimport commented 6 years ago

I would like to start discussing removing the limit of \~2B documents that we have for indices, while still enforcing it at the segment level for practical reasons.

Postings, stored fields, and all other codec APIs would keep working on integers to represent doc ids. Only top-level doc ids and numbers of documents would need to move to a long. I say "only" because we now mostly consume indices per-segment, but there is still a number of places where we identify documents by their top-level doc ID like IndexReader#document, top-docs collectors, etc.


Migrated from LUCENE-8321 by Adrien Grand (@jpountz), updated Feb 14 2020

asfimport commented 6 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I like this idea a lot. It's progress over perfection and it would simplify the accounting in IW dramatically (on the other hand I think it's nice to have this accounting for assertion purposes ie. just to make sure we have correct counts)!!

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I have thought about this, I am personally against the idea because we won't be able to merge segments that large, hence creating a really big trap.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also I think the IW accounting needs to stay. Considering we can reasonably merge segments of \~ 1B docs then i think it makes sense to up the limit to 16B or so, but any higher gets into trappy territory. Strongly feel it can't be "unlimited" as long as a single segment is limited.

But I'm concerned this small increase is worth the complexity cost: both on users and on the code: it certainly won't make things any simpler. Also I can see people complaining about what seems like an "arbitrary" limit in the code, even though its no more arbitrary than 2B. But we could try it out and see what it looks like?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

we could try it out and see what it looks like?

+1 I'd be curious to know how much of a rabbit hole this change would end up being.

asfimport commented 4 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Part of the rabbit hole would be the number of segments. TMP has a default segment size cap of 5G for instance. We could certainly up that or create a new merge policy for indexes with lots of docs...

On a separate note I've seen instances of terabyte-scale indexes on disk. Allowing that to grow by a factor of 8 would be another part of the rabbit hole.

That said, I'm not against the idea at all. I'm pretty sure operational issues would pop out, but that's progress...