blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
9.88k stars 673 forks source link

allow indexing a numeric value as text #1154

Open mschoch opened 5 years ago

mschoch commented 5 years ago

Currently, we don't offer users any way to index a numeric value as text. But this would be very useful for many cases. Consider a JSON document containing numeric values that are identifiers. Users would never want to run a range search (though they might do so as a workaround). Instead users would prefer to do an exact match.

The complication is that numbers don't have a canonical text representation. Would you want to search for 0.0000000314 or 3.14e-7. I think we could offer a reasonable default, that would probably work well for integers out of the box, and possibly add a new optional format string to the field mapping. If it is present, in circumstances like this it would be used.

mschoch commented 5 years ago

I had been thinking about this before, but was reminded again by this thread:

https://forums.couchbase.com/t/keyword-analyzer-with-type-text-on-integer-fields/20364

And after a conversation with @sreekanth-cb we decided to log this issue.

mschoch commented 5 years ago

First, let's document which conversions we already support. For purposes of this discussion "conversion" means that the source field type does not directly correspond to the indexed field type.

Souce Field Explicitly Mapped as... Indexed FIeld When...
String GeoPoint GeoPoint Comma Separated Lon,Lat or GeoHash
String DateTime DateTime Parses with configured date format without error
String N/A DateTime Parses with configured date format without error
Struct GeoPoint GeoPoint has exported fields case-insensitive match on lat and lon or lng OR satisfies Later and Loner or Lnger interface
Struct Text Text Satisfies encoding.TextMarshaler interface and NOT time.Time
Pointer Text Text Satisfies encoding.TextMarshaler and all mapped fields are type text
Map GeoPoint GeoPoint exact keys lat and lon or lng
Slice GeoPoint GeoPoint slice length 2 (GeoJSON)

One interesting observation here is that String->DateTime is the only conversion we support without an explicit mapping guiding us as to what the user wants. This is partly historical, but also related to the idea that we generally want to avoid ambiguous situations where a source value could be indexed different ways, and our decision is arbitrary. We now tend to prefer relying on an explicit mapping to guide us in the choice.

The main proposal here is to add the following conversions:

Souce Field Explicitly Mapped as... Indexed FIeld When...
int, int8, int16, int32, int64, uint, uint8, uint16, uint64, float32, float64 Text Text field has non-empty attribute format interpretted as fmt.Sprintf format specifier

My initial thought is that we don't over complicate this like date formats. There is no default format, there is no inherited hierarchy, only an explicit format specified for each field you want converted.

This proposal would support a few common use cases:

Source document has numeric identifier like:

"productID": 289

The user only ever wants to do exact lookups on product IDs, so the cost incurred to index this as a number provides no additional value. In this case, they'd like to use a simple conversion format like %d. And then when searching they can use the string "289" to find a match.

A variation on this use case is one where users have a small number of possible values. Let's say they have a field containing the age of human beings in years. Practically this means that the values will all be in the range [0,125]. In this case the user would still like to be able to do range searches, but because total number of possible values is small, the cost of numeric field indexing is still to high. Alternatively if they could index this field as text, they could instead use term range searches. In this case the user would choose a format that uses 0-padding, so that a term range search will still see properly ordered values. In this case a format of %03d might be an appropriate solution.

Another use case is one where users simply want to index a text version of a value in addition to the regular numeric indexing. Perhaps a user wants to index the field as a number for range search, but also allow exact matching using exponential notation. In this case they could map it first as a number, but then also a second time as text, and use the format specifier of their choosing.

For this proposal, I suggest there be no explicit error handling or reporting to users. The value returned by fmt.Sprintf gets indexed, and any errors are simply ignored.

Other cases proposed:

Souce Field Explicitly Mapped as... Indexed FIeld When...
int, int8, int16, int32, int64, uint, uint8, uint16, uint64, float32, float64 DateTime DateTime field has non-empty attribute unitSinceEpoch??? Values like "second", "millisecond", "nanosecod"???

I think there needs to be some refinement on the attribute name/values, but this too seems straightforward to implement.

Souce Field Explicitly Mapped as... Indexed FIeld When...
int, int8, int16, int32, int64, uint, uint8, uint16, uint64, float32, float64 Text Text Same as above, but the user wants to index a text formatted date, having started with a numeric unit since epoch???

This can be seen as doing 2 conversions, Number->Date then Date->Text. I can see this being useful, but it starts to stretch the configuration complexity.

At this point, it's fair to ask where this ends? What if a user has a date as text in their source document, but they want to index it as text with the date formatted differently? This seems more clearly out of bounds to me, but it's hard to articulate why. Obviously we want to support reasonable ways of interpreting values for indexing, but not encourage doing computation inside of bleve.

Ping @steveyen @sreekanth-cb @abhinavdangeti for feedback

sreekanth-cb commented 5 years ago

It looks good to me, no meaningful thoughts/points coming to mind. One follow up query would be along with the "unitSinceEpoch" format, what would be the default Date format? Does it make sense to take a date time format as well along with epoch? (may be it contradicts with your point of bringing computations/conversions to bleve)

mschoch commented 5 years ago

@sreekanth-cb in the case where I referenced untilSinceEpoch the source field was numeric, so it presumably contained seconds/ms/ns since 1970-whatever, and we turn that into a date/time inside bleve. So in that case there is no need for a date format.

It's that final case where you're right you'd need date->string format. I guess we could see about the date format we already have, but I can already imaging people wanting a different format for parsing vs outputting it here.

oderwat commented 1 year ago

I wonder if that would help with my problem. I need to do a search like "floor(number/100)=123 or number=123" which could also be done using a regular expression like "/123([0-9]{2}|)". But when I understand it right, I can't do regular expression searches in indexed numbers, right? I now use a text field to index it and that works, and for this case, I do not need number ranges.

Is there something else that I just don't see on how to solve this?