blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
9.83k stars 669 forks source link

handling non-utf8 bytes #186

Open mschoch opened 9 years ago

mschoch commented 9 years ago

Current Status:

Bleve made the decision early on to focus on utf-8 data. Not necessarily at the exclusion of other encodings, but if savings/optimizations/internal formats came with utf-8, that was the direction we would go.

Two issues come up:

  1. User is OK with utf-8 requirement, but sometimes bad data comes in that isn't valid utf8.
  2. User wants to index data that is not utf-8.

For now, I'm not trying to solve item 2 above, but if a solution to 1 opens the door for 2, then that should be considered as well.

These are the options I see:

  1. Blindly assume all data is already valid utf-8 (fast, unsafe in general case)
  2. Check bytes for utf-8 validity, no conversion of invalid, return error (safe, slower as we check every []byte, but probably fastest safe option, causes a single bad character to prevent processing of whole document)
  3. Check bytes for utf-8 validity, continue to process converted bytes after removing invalid utf-8 byte sequences (safe, slow, a single invalid character in field causes whole field to go through slower conversion process)

Currently Bleve follows 1.

My preference is for defaulting to 2, but allowing advanced users to switch back to 1 if they know what they're doing.

I'm not a big fan of 3 right now because it prescribes a solution that may be slow and incorrect. This means the burden would still be on the application to ensure correct utf-8 bytes are input, but it would mean that Bleve is safe on all inputs.

I welcome other thoughts on this issue.

CC @steveyen

steveyen commented 9 years ago

Seems like a good plan (defaulting to 2, and optional for advanced users to choose 1). Assuming that stuff like this would one day be something in the index mapping.

mschoch commented 7 years ago

Moving this to 2.0. For 1.0 we'll simply document that we only support UTF-8.