Open asfimport opened 11 years ago
Robert Muir (@rmuir) (migrated from JIRA)
patch that
some tests fail. we should see what they are doing (I suspect test-utility classes making crazy readers with their own cache keys that never get closed, but maybe something else interesting)
Michael McCandless (@mikemccand) (migrated from JIRA)
+1 to cutover to non-weak HashMap.
Chris M. Hostetter (@hossman) (migrated from JIRA)
I understand the value of the improved testing – but is there any tangible benefit to users to converting the WeakHashMap -> HashMap?
Robert Muir (@rmuir) (migrated from JIRA)
there need not be tangible benefits to users for us to make a change.
we can make a change because its refactoring our codebase and making it cleaner or many other reasons.
Chris M. Hostetter (@hossman) (migrated from JIRA)
its refactoring our codebase and making it cleaner or many other reasons.
How does this do that?
I'm not trying to be confrontational, i'm genuinely not understanding what is improved by switching away from a WeakHashMap and i just want to make sure i'm not missunderstanding something about the big picture.
(If you proposed to get rid of the Map completely and have the Caches hang directly off the readers (something i remember discussing a LOOOONG time ago that people seemed to think was a good idea but no one seemd to have bandwidth for) then i could totally understand arguments that doing so would be making the codebase cleaner – but i'm not understanding what's clearner/better about using a global static HashMap instead of a WeakHashMap.)
Robert Muir (@rmuir) (migrated from JIRA)
I don't think we should hang "fieldcaches" directly off readers.
Fieldcache is a broken design: users who want to sort/facet on a field should index their content correctly with docvalues instead.
Its fine for it to be wrapped underneath an UninvertingFilterReader, that also takes a Map<String,DocValuesType> up front though, but by no means should we have this broken shit tightly coupled with stuff in lucene/core
Chris M. Hostetter (@hossman) (migrated from JIRA)
by no means should we have this broken shit tightly coupled with stuff in lucene/core
ok fine, fieldcaches as a concept is fundementally broken, and the suggestion of hanging the caches of index readers is stupid in this day and age of docvalues – forget it, then. It was simply an aside comment.
Can you (or anyone else) please help me understand why keeping these "broken" caches in global static HashMaps is cleaner/better then keeping them in global static WeakHashMaps?
Yonik Seeley (@yonik) (migrated from JIRA)
ok fine, fieldcaches as a concept is fundementally broken, and the suggestion of hanging the caches of index readers is stupid in this day and age of docvalues
Meh... I happen to disagree with both of those assertions.
Robert Muir (@rmuir) (migrated from JIRA)
Thats fine, users who want to have slow reopen times and inefficient use of memory can use an FilterReader that uninverts and exposes stuff via the AtomicReader docValues APIs.
Such a FilterReader is useful as a transition mechanism anyway: users could pass it to addIndexes to 'upgrade' to docvalues.
Yonik Seeley (@yonik) (migrated from JIRA)
users who want to have slow reopen times and inefficient use of memory can use an FilterReader that uninverts and exposes stuff via the AtomicReader docValues APIs
As long as it's not slower / more inefficient than the field cache stuff we have today, that's fine. Just a different API to access it.
Yonik Seeley (@yonik) (migrated from JIRA)
Hoss - wrt WeakHashMap, I guess if it's not needed, lookups could be very slightly faster, debugging maybe slightly easier (if you're looking at weak references on the heap for example), and the code easier to understand (since we are not in fact relying on a weak reference to clean anything up anymore).
Robert Muir (@rmuir) (migrated from JIRA)
Yes, i dont think we should make it slower. but with such a filterreader, it could be implemented cleaner/differently more easily, because people can access it thru DV apis rather than thru FieldCache.DEFAULT.xxx
Shawn Heisey (@elyograg) (migrated from JIRA)
How is Solr affected by the idea of not using fieldcache? From what I understand, docValues have the same data that would go into a stored field, not the indexed field ... so they might not behave exactly the same when it comes to sorting. Although it doesn't make any sense to sort on a fully tokenized field, it can make sense to sort (or facet) on a field that uses the keyword tokenizer, the lowercase filter, and the trim filter. I don't think that's possible with docValues.
Robert Muir (@rmuir) (migrated from JIRA)
Thats all solr issues, its not related here.
DocValues fields take a byte[]. you can put whatever you want in there, it doesnt have to be the same as what goes in a stored field, you can run it thru an analysis chain yourself in some solr document processor or something like that if you really want to do that... you don't need indexwriter's help.
Yonik Seeley (@yonik) (migrated from JIRA)
How is Solr affected by the idea of not using fieldcache?
AFAIK, fieldcache-type functionality isn't going away, and any Lucene API changes (like FieldCache.DEFAULT, etc) will be hidden by Solr. DocValues adds additional functionality, and does not take away from anything we already have.
There are currently one or two disadvantages to using docvalues in solr currently. The major disadvantage is being forced to specify a default value in Solr, thus making them impossible to use from dynamic fields. Ideally we'd be able to specify a "missing" value (i.e. the value used when there is no value for that document... or rather the "default" at the DocValues layer), and that would allow us to remove the currently mandated default value at the Solr layer.
Robert Muir (@rmuir) (migrated from JIRA)
But really this should just be an option on the solr fieldtype, you know when the value is missing, solr just puts that missing value in the dv field for that document (nowhere else like stored fields or anything like that).
same with if you want the fieldtype to run stuff thru an analyzer or anything like that. I dont think this stuff really has to do with lucene, it can just be options in solrs update process/fieldtypes.
Yonik Seeley (@yonik) (migrated from JIRA)
But really this should just be an option on the solr fieldtype, you know when the value is missing, solr just puts that missing value in the dv field for that document.
But doing it at the solr level means that you can't define a field without using it. Defining a non-docvalues field in solr and not using it currently incurs no overhead. This isn't the case with docvalues, and I don't believe docvalues allows one to currently specify a default (it always defaults to 0 for missing values?)
Robert Muir (@rmuir) (migrated from JIRA)
why would have fields defined you arent using? I dont understnad the use case here.
and to say there is no overhead with fieldcache is wrong: it makes huge arrays even if one document has the field. so I'm really missing something: in both cases its a column-stride field, just one is built at index-time.
I dont understand why docvalues needs to allow you to specify a default, when you can just specify your own for the document with missing value (if you specify 0, its no different than if it fills that for you).
Yonik Seeley (@yonik) (migrated from JIRA)
why would have fields defined you arent using?
Why not? Other fields take no resources if you don't use them, but docvalues do. Dynamic fields represent an almost infinite number of fields you aren't using until you do. BTW, this is why the only uses of docvalues in the example schema are commented out. Who want's to incur index overhead for fields you aren't even using?
If we want anything other than 0 for a value for "missing", we must do it in DocValues.
Robert Muir (@rmuir) (migrated from JIRA)
Thats so wrong: if you want a value other than 0 (say 20) you just set the missing value yourself by setting it in the o.a.l.Document you add to indexwriter.
There is absolutely no reason for indexwriter to do things for you that you can do yourself.
Robert Muir (@rmuir) (migrated from JIRA)
By the way: just to explain how it works in FieldCache when you supply these "sort missing/first/last stuff". The way that works is with a separate fieldcache "field" (a bitset) which marks documents that have a value. So its really a separate fieldcache'd boolean[maxdoc] telling you if there is anything there or not for the original field.
You can easily do the exact same parallel with docvalues yourself (e.g. in a solr fieldtype) if you want to support those options (a numericdvfield only 1 or 0, takes 1 bit per document, same situation).
Yonik Seeley (@yonik) (migrated from JIRA)
Thats so wrong: if you want a value other than 0 (say 20) you just set the missing value yourself by setting it in the o.a.l.Document you add to indexwriter.
And how exactly could we do that with dynamic fields? Face it - if we want any other defaults than 0 (for numerics) it needs to somehow be configurable.
Robert Muir (@rmuir) (migrated from JIRA)
It doesnt need to be configurable. its not configurable in fieldcache today! It has a separate bitset cached on fieldcache indicating 1 or 0.
so you just do the same thing with docvalues, as I explained earlier.
Yonik Seeley (@yonik) (migrated from JIRA)
I've opened #6242 for this. It's not strictly related to the FieldCache.
If we are registering close listeners why does this need to be weak?
But really i dont care about that, here is what i said to Hoss on the solr mailing list:
Migrated from LUCENE-5177 by Robert Muir (@rmuir), updated Aug 16 2013 Attachments: LUCENE-5177.patch