blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.09k stars 686 forks source link

Support encoding.TextMarshaler interface when indexing Go types #281

Closed mmindenhall closed 7 years ago

mmindenhall commented 8 years ago

The standard library includes a set of Marshaler/Unmarshaler interfaces that can be implemented by types to convert to/from different data encodings.

Since bleve is for "full-text search and indexing", it makes sense (at least IMHO) for bleve to support the encoding.TextMarshaler interface when indexing Go types (i.e., structs, fields within structs, map values, slice/array elements, dereferenced pointers). This provides some flexibility in mapping from an internal representation within a defined type, to a canonical text representation that should be used for indexing.

I have a specific use case where this is needed. I've been working on creating indexed document types from structs that use bson.ObjectId as an ID. For example:

type BaseModel struct {
    Id        bson.ObjectId `json:"id"`
    CreatedAt *time.Time    `json:"created_at"`
    UpdatedAt *time.Time    `json:"updated_at"`
}

I'm having difficulty indexing this as a struct because what gets indexed is the internal 12-byte representation cast as a string rather than the hex-encoded representation that's used when encoding to/from JSON. I need to query using the hex string representation, but this returns "No matches". There's an outstanding pull request for bson.ObjectId to implement the encoding.TextMarshaler and encoding.TextUnmarshaler interfaces to convert the internal 12-byte representation to/from a hex string. Combining that pull request with TextMarshaler support in bleve, it would be possible to correctly index this type.

Of course, I'm able to work around this issue by marshaling my struct to JSON and indexing that representation, but it would be nice to avoid the extra step!

mschoch commented 8 years ago

My first thought is that supporting encoding.TextMarshaler sounds like a good idea. The only reservation I have is that sometimes these really generic interfaces get used for different/multiple purposes. So, today it solves your problem, but then tomorrow someone has a structure that uses TextMarshaler in a different way, and they really do want to index the contents of the structure itself. We already encountered a similar problem by using the json struct tags to set field names, but in some cases users wanted the Bleve field names to be different than their JSON field names.

Another question I wonder, what is special about TextMarshaler, why not Stringer? I can certainly think of a lot more counter-examples where Stringer is not what you want to index, but to me that just emphasizes the point. Why do we think the output of TextMarshaler is suitable for indexing?

mmindenhall commented 8 years ago

A few thoughts.

Sure, these interfaces can get used in all sorts of strange ways, but I think the package descriptions shed some light.

TextMarshaler is within the encoding package, intended to "convert data to and from byte-level and textual representations". So it seems that when implementing this interface, more care should be taken to define a correct and concise text encoding for the type.

Stringer is in the fmt package, intended for "formatted I/O with functions analogous to C's printf and scanf". I think Stringer is most commonly used to produce human-readable output (e.g., for debugging), and less thought goes into the format/representation of the data. Usually (at least for me), it's good enough if I can use it to determine something about the state of the system at some point in execution.

Maybe someone out there is using TextMarshaler already in a different way, and introducing this support would break their search index. In that case, perhaps it could be treated similar to an Analyzer, which has a default that can be overridden at the IndexMapping, DocumentMapping, or FieldMapping? For the reasons given above, I think TextMarshaler should be the default, unless explicitly overridden. Also, if TextMarshaler isn't implemented, obviously the current Stringer behavior should apply.

mschoch commented 8 years ago

Yeah, I'm inclined to agree, but I think we'll need some way to override it in the mapping as you suggest.

Just to be clear, there is no current behavior for Stringer, in fact I think it would break most use cases if we did. A lot of users have String() methods that they mainly rely on for debug printing, but they want to index the underlying fields, not a single string value.

Adding support for TextMarshaler should't be too hard, we just have to check for that case while walking the structure using reflection. The trickier part will be finding a clean way to override it in the mapping if thats not what you want.

jackspirou commented 8 years ago

FYI - I submitted a new PR for encoding.TextMarshaler at https://github.com/go-mgo/mgo/pull/212 against the unstable branch.

jackspirou commented 8 years ago

@mmindenhall and @mschoch the patch was accepted in this latest MGO release http://blog.labix.org/2016/02/04/mgo-r2016-02-04. I think this issue can be closed, unless you guys want a different implementation.

mmindenhall commented 8 years ago

@jackspirou, thanks for getting TextMarshaler support into mgo! @mschoch, please don't close this, as we still need the support within bleve to match.