apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Support numeric value in Field class [LUCENE-8870] #9913

Open asfimport opened 5 years ago

asfimport commented 5 years ago

I checked the following comment in Field class.

// TODO: allow direct construction of int, long, float, double value too..?

We already have some fields like IntPoint and StoredField, but I think it's okay.

The test cases are set in the TestField class.


Migrated from LUCENE-8870 by Namgyu Kim (@danmuzi), updated Jun 26 2019 Attachments: LUCENE-8870.patch

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

About the TestField class, I think its class structure needs to be changed slightly.

It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name?

What do you think about it?

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Personally I find the Field type-facade kind of annoying; it imposes this artificial type safety: in the end we store the value as an Object and then later cast it. Callers that also handle values generically, as Objects then need an adapter to detect  the type of a value, cast it properly, only to have Lucene throw away all the type info and do that dance all over again internally!

Having said that, it's not really relevant to this change, which seems helpful. Maybe use Objects.requireNonNull for the null checks?

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Thank you for your reply! @msokolov :D

I want to make sure that I understand your opinion well.

in the end we store the value as an Object and then later cast it. Callers that also handle values generically, as Objects then need an adapter to detect the type of a value, cast it properly, only to have Lucene throw away all the type info and do that dance all over again internally!

I think this is a structure for providing various constructors to users. (Reader, CharSequence, BytesRef, ...) Of course we can only provide it in Object form and handle it in the constructor. But isn't it unfriendly to API users? And I'm not sure about it because the IndexableFieldType check logic is different depending on the value type. ex) BytesRef -> there is no check logic. CharSequence -> (!IndexableFieldType#stored() && IndexableFieldType#indexOptions() == IndexOptions.NONE) should be false. Reader -> (IndexableFieldType#indexOptions() == IndexOptions.NONE || !IndexableFieldType#tokenized()) and (IndexableFieldType#stored()) should be false.

In fact, I worried about using the Number class when writing this patch. I think API users may prefer int, float, double, ... rather than the Number class. What do you think about this?

Maybe use Objects.requireNonNull for the null checks?

I made it by referring to the current code structure. That method generates the NullPointerException, and current structure is the IllegalArgumentException.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think this new constructor would be misleading. For instance it might be tempting to use if you wanted to index doubles. But the only thing you can index with this constructor is the string representation of the double values, which is unlikely to be helpful.

I wonder whether we should make this class abstract instead so that it can't be instantiated directly, and potentially enhance some of its sub classes to address use-cases that were only doable with this Field class until now, such as having a text field with term vectors enabled.

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Thank you for your reply! @jpountz :D

When I wrote a patch, the biggest advantage that I think is the FieldType conversion for Numeric types. Of course, it is not recommended way(it is already mentioned Expert in Javadoc), but it can give users FieldType customization.

ex) Currently NumericDocValuesField does not support the option for stored. So users need to add a separate StoredField. If we provide this patch, the user can get the characteristics of NumericDocValuesField and StoredField in a single field.

FieldType type = new FieldType();
type.setStored(true);
type.setDocValuesType(DocValuesType.NUMERIC);
type.freeze();

Document doc = new Document();
Field field = new Field("number", 1234, type);
doc.add(field);
indexWriter.addDocument(doc);

After that, we can use methods such as

Sort sort = new Sort();
sort.setSort(new SortField("number", SortField.Type.INT));

and

doc.get("number");

in the "number" field.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Currently NumericDocValuesField does not support the option for stored. So users need to add a separate StoredField.

I'm viewing this one as a feature. :) Having the same field indexed and stored often makes sense because the analyzer takes care of converting the text into the internal representation that the index cares about. But for doc values, we expect users to do this work of converting the data to what works for Lucene. For instance if you are indexing doubles, you would likely index as a byte[] in points, use the long bits of the double in doc values, and use the double directly for storing, these are 3 different representations for the same data. My gut feeling is that trying to fold everything into a single field would make things more complicated rather than simpler.

asfimport commented 5 years ago

Namgyu Kim (@danmuzi) (migrated from JIRA)

Thank you for your reply! @jpountz :D

My gut feeling is that trying to fold everything into a single field would make things more complicated rather than simpler.

Yeah, I agree with some parts. I thought it would be nice to give more options to user. But if you think it can cause some confusions to users, I think remaining the current status is better.

About the TestField class, I think its class structure needs to be changed slightly.

It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name?

By the way, what do you think about my first comment in this issue? It looks a little bit ambiguous, but I'm curious about your opinion.