fizx / sit

streaming index tool
Other
34 stars 4 forks source link

SIT (Streaming Index Toy)

SIT is a lightweight TCP server that provides real-time full-text search over streams of JSON documents. It's also usable as a C library, where it can parse any stream using custom parsers.

Why?

Protocol

SIT speaks a simple line-based pipelineable protocol. Any line that starts with a curly brace ({) is interpreted as a JSON document to add to the indexed dataset. Other lines are interpreted as commands.

All responses are JSON lines, and have a status key, which can be either ok, or error. Other keys are implemented on a per-command basis.

Search & Percolation

SIT provides traditional search, where you give a query and get a resultset back. It also provides percolation, where you can register a query that will notify you when matching documents are added to the index.

Commands

Query Language

SIT has a simple query language, composed of boolean operations (AND, OR, NOT) over clauses. You can append LIMIT N to the end of a query. Queries are terminated with either a newline or a semicolon. The following are valid clauses:

  field_name ~ string 
  field_name > integer
  field_name = integer
  field_name < integer
  field_name >= integer
  field_name <= integer
  field_name != integer

Tokenization

SIT is designed to use pluggable tokenization strategies.

TODO: Describe tokenization 101 basics, concepts, examples.

Supported tokenization strategies

Pull requests are welcome.

What is the tilde?

The tilde indicates a full-text search. A full-text search identifies documents where a given term is present in the specified field.

The full-text search, title ~ hello will match JSON documents with a field named title which contain a token of hello. For example, the document {"title":"hello full text search"} when tokenized with a simple whitespace tokenizer.

A quoted value for the tilde operator will match documents where all the terms of the quoted text are present in the named field. For example, title ~ "hello world" will be transformed into (title ~ hello AND title ~ world).

TODO

Quick Start

Feeling adventurous? Run SIT on your own system and try some of these demos.

Downloading and building SIT

git clone git://github.com/fizx/sit.git
cd sit
bundle install
bundle exec rake

Demo: Running SIT with simple inputs and searches

Start the server with ./sit, which reads commands from standard input, and prints its output to standard out.

./sit
{"hello":"world"}
# {"status": "ok", "message": "added", "doc_id": 0"}
query hello ~ world;
# {"status": "ok", "message": "querying", "id": 0}
# {"status": "ok", "message": "found", "query_id": 0, "doc_id": 0, "doc":
# {"hello":"world"}}
# {"status": "ok", "message": "complete", "id": 0}

Demo: Twitter Streaming API

Reminder, this is pre-release software. We use Twitter Streaming APIs as testing to find new and novel edge cases which cause crashes.

You should assume that this demo has a 50/50 chance to format your hard drive, and proceed accordingly (you're piping raw internet data through a pre-release C program).

First, install and authenticate twurl.

Next, start a server listening to the network.

./sit --port 4000
[INFO] [2013:03:0221:15:53] Successfully started server.

Now stream documents from Twitter, via netcat, to your running server.

twurl -t -H stream.twitter.com /1/statuses/sample.json | nc localhost 4000