apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

remove fieldcache weakmap or at least see what relies on GC for purging today [LUCENE-5177] #6241

Open asfimport opened 11 years ago

asfimport commented 11 years ago

If we are registering close listeners why does this need to be weak?

But really i dont care about that, here is what i said to Hoss on the solr mailing list:

> (In any case: it looks like a WeakHashMap is still used in case the > listeners never get called, correct?) >

I think it might be the other way around: i think it was weakmap before always, the close listeners were then added sometime in 3.x series, so we registered purge events "as an optimization".

But one way to look at it is: readers should really get closed, so why have the weak map and not just a regular hashmap.

Even if we want to keep the weak map (seriously i dont care, and i dont want to be the guy fielding complaints on this), I'm going to open with an issue with a patch that removes it and fails tests in @afterclass if there is any entries. This way its totally clear if/when/where anything is "relying on GC" today here and we can at least look at that.


Migrated from LUCENE-5177 by Robert Muir (@rmuir), updated Aug 16 2013 Attachments: LUCENE-5177.patch

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch that

some tests fail. we should see what they are doing (I suspect test-utility classes making crazy readers with their own cache keys that never get closed, but maybe something else interesting)

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to cutover to non-weak HashMap.

asfimport commented 11 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I understand the value of the improved testing – but is there any tangible benefit to users to converting the WeakHashMap -> HashMap?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

there need not be tangible benefits to users for us to make a change.

we can make a change because its refactoring our codebase and making it cleaner or many other reasons.

asfimport commented 11 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

its refactoring our codebase and making it cleaner or many other reasons.

How does this do that?

I'm not trying to be confrontational, i'm genuinely not understanding what is improved by switching away from a WeakHashMap and i just want to make sure i'm not missunderstanding something about the big picture.

(If you proposed to get rid of the Map completely and have the Caches hang directly off the readers (something i remember discussing a LOOOONG time ago that people seemed to think was a good idea but no one seemd to have bandwidth for) then i could totally understand arguments that doing so would be making the codebase cleaner – but i'm not understanding what's clearner/better about using a global static HashMap instead of a WeakHashMap.)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we should hang "fieldcaches" directly off readers.

Fieldcache is a broken design: users who want to sort/facet on a field should index their content correctly with docvalues instead.

Its fine for it to be wrapped underneath an UninvertingFilterReader, that also takes a Map<String,DocValuesType> up front though, but by no means should we have this broken shit tightly coupled with stuff in lucene/core

asfimport commented 11 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

by no means should we have this broken shit tightly coupled with stuff in lucene/core

ok fine, fieldcaches as a concept is fundementally broken, and the suggestion of hanging the caches of index readers is stupid in this day and age of docvalues – forget it, then. It was simply an aside comment.

Can you (or anyone else) please help me understand why keeping these "broken" caches in global static HashMaps is cleaner/better then keeping them in global static WeakHashMaps?

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

ok fine, fieldcaches as a concept is fundementally broken, and the suggestion of hanging the caches of index readers is stupid in this day and age of docvalues

Meh... I happen to disagree with both of those assertions.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats fine, users who want to have slow reopen times and inefficient use of memory can use an FilterReader that uninverts and exposes stuff via the AtomicReader docValues APIs.

Such a FilterReader is useful as a transition mechanism anyway: users could pass it to addIndexes to 'upgrade' to docvalues.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

users who want to have slow reopen times and inefficient use of memory can use an FilterReader that uninverts and exposes stuff via the AtomicReader docValues APIs

As long as it's not slower / more inefficient than the field cache stuff we have today, that's fine. Just a different API to access it.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Hoss - wrt WeakHashMap, I guess if it's not needed, lookups could be very slightly faster, debugging maybe slightly easier (if you're looking at weak references on the heap for example), and the code easier to understand (since we are not in fact relying on a weak reference to clean anything up anymore).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, i dont think we should make it slower. but with such a filterreader, it could be implemented cleaner/differently more easily, because people can access it thru DV apis rather than thru FieldCache.DEFAULT.xxx

asfimport commented 11 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

How is Solr affected by the idea of not using fieldcache? From what I understand, docValues have the same data that would go into a stored field, not the indexed field ... so they might not behave exactly the same when it comes to sorting. Although it doesn't make any sense to sort on a fully tokenized field, it can make sense to sort (or facet) on a field that uses the keyword tokenizer, the lowercase filter, and the trim filter. I don't think that's possible with docValues.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats all solr issues, its not related here.

DocValues fields take a byte[]. you can put whatever you want in there, it doesnt have to be the same as what goes in a stored field, you can run it thru an analysis chain yourself in some solr document processor or something like that if you really want to do that... you don't need indexwriter's help.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

How is Solr affected by the idea of not using fieldcache?

AFAIK, fieldcache-type functionality isn't going away, and any Lucene API changes (like FieldCache.DEFAULT, etc) will be hidden by Solr. DocValues adds additional functionality, and does not take away from anything we already have.

There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr, thus making them impossible to use from dynamic fields. Ideally we'd be able to specify a "missing" value (i.e. the value used when there is no value for that document... or rather the "default" at the DocValues layer), and that would allow us to remove the currently mandated default value at the Solr layer.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But really this should just be an option on the solr fieldtype, you know when the value is missing, solr just puts that missing value in the dv field for that document (nowhere else like stored fields or anything like that).

same with if you want the fieldtype to run stuff thru an analyzer or anything like that. I dont think this stuff really has to do with lucene, it can just be options in solrs update process/fieldtypes.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

But really this should just be an option on the solr fieldtype, you know when the value is missing, solr just puts that missing value in the dv field for that document.

But doing it at the solr level means that you can't define a field without using it. Defining a non-docvalues field in solr and not using it currently incurs no overhead. This isn't the case with docvalues, and I don't believe docvalues allows one to currently specify a default (it always defaults to 0 for missing values?)

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

why would have fields defined you arent using? I dont understnad the use case here.

and to say there is no overhead with fieldcache is wrong: it makes huge arrays even if one document has the field. so I'm really missing something: in both cases its a column-stride field, just one is built at index-time.

I dont understand why docvalues needs to allow you to specify a default, when you can just specify your own for the document with missing value (if you specify 0, its no different than if it fills that for you).

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

why would have fields defined you arent using?

Why not? Other fields take no resources if you don't use them, but docvalues do. Dynamic fields represent an almost infinite number of fields you aren't using until you do. BTW, this is why the only uses of docvalues in the example schema are commented out. Who want's to incur index overhead for fields you aren't even using?

If we want anything other than 0 for a value for "missing", we must do it in DocValues.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thats so wrong: if you want a value other than 0 (say 20) you just set the missing value yourself by setting it in the o.a.l.Document you add to indexwriter.

There is absolutely no reason for indexwriter to do things for you that you can do yourself.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By the way: just to explain how it works in FieldCache when you supply these "sort missing/first/last stuff". The way that works is with a separate fieldcache "field" (a bitset) which marks documents that have a value. So its really a separate fieldcache'd boolean[maxdoc] telling you if there is anything there or not for the original field.

You can easily do the exact same parallel with docvalues yourself (e.g. in a solr fieldtype) if you want to support those options (a numericdvfield only 1 or 0, takes 1 bit per document, same situation).

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Thats so wrong: if you want a value other than 0 (say 20) you just set the missing value yourself by setting it in the o.a.l.Document you add to indexwriter.

And how exactly could we do that with dynamic fields? Face it - if we want any other defaults than 0 (for numerics) it needs to somehow be configurable.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It doesnt need to be configurable. its not configurable in fieldcache today! It has a separate bitset cached on fieldcache indicating 1 or 0.

so you just do the same thing with docvalues, as I explained earlier.

asfimport commented 11 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I've opened #6242 for this. It's not strictly related to the FieldCache.