blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.02k stars 680 forks source link

GopherJS and storing the index as JSON #667

Open ChrisTrenkamp opened 6 years ago

ChrisTrenkamp commented 6 years ago

We generate HTML documentation for our customers and need to provide a search solution for the one's that do not have internet access. Long story short, we're investigating Bleve + GopherJS as a candidate for providing an in-browser, plugin-less search solution. We've looked at other solutions like Lunr.js, but they were insufficient.

I pulled Bleve and removed the disk-backed storage interfaces under github.com/blevesearch/bleve/tree/master/index/store/, and fixed any compilation errors that tried to reference those packages. It is now only using the gtreap package.

Then I compiled a simple example:

//main.go
package main

import (
    "fmt"

    "github.com/blevesearch/bleve"
)

func main() {
    message := struct {
        Id   string
        From string
        Body string
    }{
        Id:   "example",
        From: "marty.schoch@gmail.com",
        Body: "bleve indexing is easy",
    }

    mapping := bleve.NewIndexMapping()
    index, err := bleve.NewMemOnly(mapping)
    if err != nil {
        panic(err)
    }
    index.Index(message.Id, message)

    query := bleve.NewQueryStringQuery("bleve")
    searchRequest := bleve.NewSearchRequest(query)
    fmt.Println(index.Search(searchRequest))
}
$ go build main.go && ./main
1 matches, showing 1 through 1, took 0s
    1. example (0.125272)
 <nil>

Then threw it at GopherJS:

$ gopherjs build main.go

$ ls
main.go  main.js  main.js.map

Then created an index.html file:

<!doctype html>

<html>
    <head>
        <script src="main.js"></script>
    </head>
</html>

Opened up index.html in a browser, and the console displayed:

syscall.go:43 1 matches, showing 1 through 1, took 15ms
syscall.go:43     1. example (0.125272)
syscall.go:43  <nil>

It works!

Now, the index needs to backed by some storage, which is what I'm stuck on. I've tried using encoding/json on mapping.IndexMappingImpl, but it did not work. Is there a way to encode IndexMappingImpl to some kind of plain string so that it can be read back in as pure in-memory storage?

mschoch commented 6 years ago

Sorry for the long delay, I forgot to reply to this one.

It's cool that you got this working, and it's something we've wanted to see working for a while. We have wanted a way for users to build indexes on static sites and still offer search to their users. The compromise we settled on was hosting a small index in app-engine or something like that.

But, the path you're going down would take it further and allow a fully self-hosted index, using javascript on the client to do the searching. Theres nothing to prevent defining an index format that uses JSON but for realistic data sizes it's probably not a very good choice. Much of the indexing process its attempting create small compact representations, and JSON goes the other way.

The other big problem would be that you want the javascript to be able to access portions of the index without downloading the entire thing. I've always thought maybe you could do this by breaking the file into pieces, or possibly use HTTP range headers.

I don't think mapping.IndexMappingImpl should be your concern, that is related to the mapping, not the actual storage. What I think you want is an implementation of index/store.KVStore interface which just works with a flat JSON file. You also need to support seeking through ranges, so just serializing it all in a map won't be sufficient. You'll have to experiment with different ideas and see what works best for you here.

Let me know if I can be of more help.

ghost commented 6 years ago

Interesting.

Chris what store approach did you take. There a few gophers stores out there that use the browsers kv store types. Or did you got a different direction.

Wondering about updates. If new content his the server can the index data de done server side and pumped to client and client merges it in there ? Or do you attempt to do indexing client side in gopherjs itself. From your initial post it looks like you got actual indexing working fully Clientside ? Do facets work by any chance client side.

ChrisTrenkamp commented 6 years ago

After some long discussions, we decided to no longer support offline users, so this no longer needed. We're using Solr.

ghost commented 6 years ago

@ChrisTrenkamp
i would like to finish it and make it a public repo. I think its really useful to use gopherjs for this.

Do you have a repo for it ?

ChrisTrenkamp commented 6 years ago

I didn't make a public repository for it since it was a quick experiment. I tried to make one a little while ago, but something happened between when this issue was first created and the Bleve release that came after it. There's a new dependency on one of the databases in the core library and I wasn't able to compile it with GopherJS. That's when I gave up and quit.

However, WebAssembly is coming. That's a much better approach than the hack I came up with.

ghost commented 6 years ago

Indeed WASM is officially supported.

Well if you happen to find the code i would be happy to give it a whirl and publish it on github.

On Tue, 17 Apr 2018 at 00:47 ChrisTrenkamp notifications@github.com wrote:

I didn't make a public repository for it since it was a quick experiment. I tried to make one a little while ago, but something happened between when this issue was first created and the Bleve release that came after it. There's a new dependency on one of the databases in the core library and I wasn't able to compile it with GopherJS. That's when I gave up and quit.

However, WebAssembly https://go-review.googlesource.com/c/go/+/102835 is coming. That's a much better approach than the hack I came up with.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blevesearch/bleve/issues/667#issuecomment-381773848, or mute the thread https://github.com/notifications/unsubscribe-auth/ATuCwpuTZ8abZ2uUrmxAg4_2OxqM9lpYks5tpR9zgaJpZM4RF1Jg .

ChrisTrenkamp commented 6 years ago

Another long story short, we're reconsidering some kind of offline search again. Here's the tweak I made to get things working with the latest code.

Here's a short demo.

go get -u github.com/gopherjs/gopherjs
go get -u github.com/ChrisTrenkamp/bleve

cat > bleve_js_search.go << EOF
package main

import (
    "fmt"

    "github.com/ChrisTrenkamp/bleve"
)

func main() {
    message := struct {
        Id   string
        From string
        Body string
    }{
        Id:   "example",
        From: "marty.schoch@gmail.com",
        Body: "bleve indexing is easy",
    }

    mapping := bleve.NewIndexMapping()
    index, err := bleve.NewMemOnly(mapping)
    if err != nil {
        panic(err)
    }
    index.Index(message.Id, message)

    query := bleve.NewQueryStringQuery("bleve")
    searchRequest := bleve.NewSearchRequest(query)
    fmt.Println(index.Search(searchRequest))
}
EOF
# GopherJS currently has a bug with GOOS=windows.  Set it to darwin to avoid it.
GOOS=darwin gopherjs run bleve_js_search.go

I also tested this with the latest wasm-wip branch and it works there. Unfortunately it only works on Firefox. Chrome crashed for some reason. However, I won't be focusing on WASM because Internet Explorer doesn't support it.

For some clarity, when I say "offline", I mean the browser is directly opening the files off of the disk (the URL is "file:///usr/doc/..."). mschoch, you mentioned breaking up the index so you don't have to download the whole thing. That is not a concern of ours because the documentation is being directly loaded from the user's disk. However, splitting the database into multiple files would still be beneficial because you can run multiple Web Workers in the background, which will utilize multiple CPU cores, decreasing the time to parse the indices and speed up queries.

And before you ask why we're not just providing a web server for running on localhost or Electron, the environments we have to support are EXTREMELY locked down. Users are not allowed to install anything; not even Firefox.

In the meantime, I'm going to start reading bleve's documentation to see if I can take a stab at implementing an index/store interface for Javascript. Thoughts? Comments?

ghost commented 6 years ago

this works for me too. WASM-WIP fails for me.

so directly opening the files off of the disk (the URL is "file:///usr/doc/...") means the files to be indexed ? but what about the scorch index ? you want to store that in broswer local storage ?

ChrisTrenkamp commented 6 years ago

The indices need to be created beforehand and dumped into a database, which will be some kind of optimized Javascript file(s). The browser will then include it with a script tag.

I don't know what scorch does. Does it apply mass updates to an index? Either way, it was pulling in boltdb, which GopherJS cannot handle, which is why it was removed. For my use-case, these indices will be read-only. If there's an update, we'll be supplying the new search indices in our update process.

For situations where the database will be constantly updated, a server-side solution will be more appropriate.

mschoch commented 6 years ago

Can we get more specific about what in particular the problem with BoltDB is?

ChrisTrenkamp commented 6 years ago

If you try to compile BoltDB with GopherJS, this is what you'll get. Remove the compiler block in config_scorch.js to reproduce.

$ GOOS=darwin gopherjs run blevetest.go
gopherjs: Source maps disabled. Install source-map-support module for nice stack traces. See https://github.com/gopherjs/gopherjs#gopherjs-run-gopherjs-test.
C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1480
        throw err;
        ^

Error: runtime error: native function not implemented: syscall.Getpagesize
    at $callDeferred (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1412:17)
    at $panic (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1451:3)
    at throw$1 (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:2428:3)
    at Object.Getpagesize (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:5435:3)
    at Object.Getpagesize (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:12553:18)
    at Object.$init (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:147289:24)
    at Object.$init (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:152939:13)
    at Object.$init (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:192875:15)
    at $init (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:192951:14)
    at $goroutine (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1471:19)
    at $runScheduled (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1511:7)
    at $schedule (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1527:5)
    at $go (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:1503:3)
    at Object.<anonymous> (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:192966:1)
    at Object.<anonymous> (C:\cygwin64\home\ct\go\bin\blevetest.go.539523515:192969:4)
    at Module._compile (module.js:643:30)
    at Object.Module._extensions..js (module.js:654:10)
    at Module.load (module.js:556:32)
    at tryModuleLoad (module.js:499:12)
    at Function.Module._load (module.js:491:3)
    at Function.Module.runMain (module.js:684:10)
    at startup (bootstrap_node.js:187:16)
    at bootstrap_node.js:608:3

EDIT: I forgot I made a tweak to BoltDB:

//bolt_js.go
package bolt

// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0x7FFFFFFF // 2GB

// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0xFFFFFFF

// Are unaligned load/stores broken on this arch?
var brokenUnaligned = false
mschoch commented 6 years ago

Thanks, so usually syscall availability is best worked around with build tags. Does gopherjs use or introduce any build tags that can be used to cordon off such things?

I know bolt is basically closed to changes at this point, has anyone tried bbolt to see if it has the same issue? Or if possibly we could get some build tag changes introduced there to make it work?

ChrisTrenkamp commented 6 years ago

Yes, you can separate GopherJS and WASM with a js build tag.

ChrisTrenkamp commented 6 years ago

bbolt will have the same issue. BoltDB is designed around memory maps, and pure Javascript does not have that kind of capability. However, WASM, even though it's not planned any time soon, will eventually support memory maps.

mschoch commented 6 years ago

Right, so thats what I was trying to get at. If mmap is a non-starter there is no point in worrying about boltdb or bbolt, scorch relies on mmap as well, so the whole effort is a dead-end.

ChrisTrenkamp commented 6 years ago

Do you mean a gopherjs/bleve build is a dead-end, or just the scorch portion of it?

ChrisTrenkamp commented 6 years ago

As I understand, scorch is going to be a replacement for the pluggable stores. Will bleve's API continue to use interfaces when accessing the index, or will they be deprecated in favor of directly accessing the scorch API? Will the stores under index/store be deprecated?

mschoch commented 6 years ago

I don't think gopherjs/bleve build is a dead-end, I'd love to see this working.

The plans are coming together still, and I expect some official communication this week, but roughly:

However, through all these releases, the index remains an interface, and alternate js-frontend-friendly implementations are possible (though likely would be maintained outside the main tree). Beyond the 1.1 release, we'll begin looking at 2.0 improvements which require API changes.

ChrisTrenkamp commented 6 years ago

I can see a Javascript build serving two purposes:

  1. A pre-built database to serve on a static website (like, say, Hugo).
  2. Running alongside a Node.JS service or application, with the disclaimer that it's nowhere near as robust as the BoltDB implementation.

Other than this patch, I don't think anything else is required. The gtreap implementation should be good enough for a JS build.

The one problem I'm trying to solve is serializing the store to a file so it can be read back in. Anyone with some experience with GopherJS have some ideas?

ChrisTrenkamp commented 6 years ago

I haven't found a good way to embed a prebuilt index with the compiled bleve code, so for now its saving the index in a separate file that the user can load.

The index file is written out line-by-line, with the key on one line and the value on the next. When saving the index, it descends the tree and writes out the values. The keys and values are encoded as base64 and gzipped.

The good news is the index can be created on any platform in any way you want. The bad news is it couldn't handle my worst-case scenario, which is an index file that's about a gigabyte. Compiled natively, it takes about a minute to load. In the browser, I don't know how long it would have taken because I killed it after I was done eating lunch.

I'm not sure where the bottleneck is yet, but my gut feeling is the gzip decompression. gopherjs doesn't seem to handle CPU-intensive operations very well, or at least operations that rely on bitwise computations.

I ran a smaller test case, and what normally would have taken a second to load ended up taking 33 seconds in the browser. Using that same test case, a query that took 1 millisecond natively ended up taking 135 milliseconds, though I don't know for sure if this will scale linearly with larger indices.

I'll keep experimenting with this. In the meantime, are there any characters that cannot be put into the keys or values in a bleve store? Maybe instead of storing the indices as a line-delimited base64 file, it can use a character that cannot be used in keys or values as a delimiter.

ChrisTrenkamp commented 6 years ago

I changed the gtreap package to use gogo/protobuf. Things are much better. In the browser, the same small test case now loads in 6 seconds.

However, it still can't handle my worst-case scenario because the Javascript runtime can't allocate a buffer large enough to read in the protobuf message.

I'll try splitting the index into multiple files and aggregate the results using Aliases.

ghost commented 6 years ago

This is really awesome. I wonder if flat buffers is worth trying. Are you holding the inverted index in memory in JS land ?

ChrisTrenkamp commented 6 years ago

I'm sure FlatBuffers would be worth it, but the API is too cumbersome and I need a working proof-of-concept ASAP (before I get pulled off onto other projects).

EDIT: Everything's in memory. It has to in order to work on the browser. If you're running on NodeJS, maybe wrappers could be made around boltdb or LMDB.

ChrisTrenkamp commented 6 years ago

I need to shelve this for now because some higher priority projects are in my pipeline that I was supposed to start last week. I'll upload my experiments soon. Here's some notes:

ghost commented 6 years ago

yeah upload the code.

  1. I have been using flat buffers and can try.
  2. WASM and golang are now best friends. there are examples using wasm with GRPC with protocol buffers and flat buffers. SO that is the next thing to try.

also i expect a service worker , web worker approach will also help the blocking issues. But thats after 1 and 2 i feel.

On Sat, 16 Jun 2018 at 00:35 ChrisTrenkamp notifications@github.com wrote:

I need to shelve this for now because some higher priority projects are in my pipeline that I was supposed to start last week. I'll upload my experiments soon. Here's some notes:

  • Turns out FlatBuffers isn't a nice-to-have, it's a necessity. Protobuf is fast, but when you get to an upwards of a gigabyte of data, no matter how you break it down, it brings Firefox and Chrome to its knees. Note that it's not loading a gigabyte of data that's the problem, Firefox and Chrome could do that just fine. Its unmarshing that gigabyte of data that will cause the browser window to crash. Flatbuffers should alleviate this, but I imagine it'll slow down the query time.
  • Speaking of loading the database, I had to resort to loading the files through the HTML File API. I tried encoding the files into a Javascript file that looked like window.db0 = new Uint8Array([34, 255, 0...]);. The browser eventually hung when trying to parse these files. Surely I'm not the only one who ran into this problem and there has to be a library out there that can encode some binary data into a script that can just be stuck into the HTML page.
  • Since the target use-case is single-user or read-only database applications, can a different store using a b-tree structure be used?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blevesearch/bleve/issues/667#issuecomment-397759056, or mute the thread https://github.com/notifications/unsubscribe-auth/ATuCwmL6QWS9y98yIUjk8i9ZNKGj5L7eks5t9DaqgaJpZM4RF1Jg .

ChrisTrenkamp commented 6 years ago

Here's the gtreap changes.

Here's the code for I ran against GopherJS

ChrisTrenkamp commented 6 years ago

Reflecting on this, I made a mistake trying to retrofit the gtreap store for this. There are too many browser limitations when working with non-Javascript and the size of the store is too big. A completely different approach is needed.

I like Lunr.JS's approach to creating the store: it's an array of plain JSON objects. Can a store be written in Bleve that uses the same approach? My bad, that was setting up the documents to be indexed, not the index itself. Still need a way to decrease the size of the index.

ghost commented 6 years ago

Maybe you can change the problem and so then the solution changes also.

You need this to run in web browsers ? You can't use a fat client with a webview inside it ?

zellyn commented 2 years ago

Sorry to necro this discussion, but would the techniques used in Hosting SQLite databases on github pages work on Bleve indexes?