NeowayLabs / neosearch

Full Text Search Library
30 stars 4 forks source link

PROPOSAL: Handling types #27

Closed i4ki closed 8 years ago

i4ki commented 9 years ago

The NeoSearch's development has simplicity in mind, and therefore we don't have a configuration mapping for indices yet. But this freedom has a cost and now started to be hard to implement important new features without a good definition of how we will handle types.

I think that a schema definition sucks in so many ways that it does not ever should be an option in NeoSearch. If needed, I think that this might be hidden from the user. The Solr and ES projects have a schema, and the last is like a dynamically generated schema, which have been shown as more evil than good. I don't read any good review on the internet related to ES fake schemaless approach. I don't know yet why schema is a requirement for write a search engine using Lucene library. But... now thinking in the way of implement the NeoSearch required features, I can see that the existence of a schema simplifies a lot the problem on our side. I think that we need stress more the available options until insert this beast inside NS.

Goals

1- Avoid automatically discover the field types; 1- Avoid re-indexing all of the documents when a field type changes; 2- Avoid documents failed to be indexed when a new type arrives for a specific field; Eg.: id field first indexed as long but in the future be a UUID string.

Problem 1

The problem of mapping first appeared on NS when I start to fix the issue #25. We still don't have date types; the implementation is easy, but the usage is impossible without some type definition at index creation or document addition.

For example, how to index the document below without previous information of what the type of field updatedAt?

{
     "id": 1,
     "name": "factotum",
     "updatedAt": "2015-05-31 00:36:54.739040784 -0300 BRT"
}

Is obvious that our problem is related to inexpressiveness of JSON. It lacks information about the type of each field. The problem above can be solved passing more information about types of fields along with document data.

Solution 1: Use document mapping instead of Index mapping

Instead of create a general mapping for the entire index, we can pass the mapping of the current document being indexed. Something like that:

{
    "mapping": {
        "id": "uint",
        "name": "string",
        "updatedAt": {
            "type": "date",
            "format": "2006-01-02 15:04:05 -0700 MST"
        }
    },
    "doc": {
        "id": 1,
        "name": "factotum",
        "updatedAt": "2015-05-31 00:36:54.739040784 -0300 BRT"
    }
}

We can omit the mappings for the obvious fields and guess the type much like did ES, but for the field "updatedAt" be indexed as Date instead of string is required the mapping specification for it.

Pros: Expressive syntax; Easy to implement; Cons: Hard to index available JSON documents in the internet;

I like the JSON specification above of "mapping" and "doc" separated to easy the process of index documents from other sources, but another idea can be the document below:

{
    "id": {
        "$value": "ee7cbc2b-1f26-42e4-ad94-a4d81929bcc5",
        "$type": "string"
    },
    "name": "factotum",
    "updatedAt": {
        "$value": "2015-05-31 00:36:54.739040784 -0300 BRT",
        "$type": "date",
        "$format": "2006-01-02 15:04:05 -0700 MST"
    }
}

The way above is nice, but complicate much more the client interface. The dollar '$' in front of field properties are needed to avoid confusion with objects/nested documents.

Pros: Expressive syntax; Easy to implement; Cons: Hard to index available JSON documents in the internet; Hard to index JSON generated files from other sources (eg.: logstash);

The third option is to pass the mapping for the indexing document by another location. For example, the URL query parameters:

POST http://$host:$port/<index>/100?id=uint&updatedAt=type:date,format:"2006-01-02 15:04:05 -0700 MST"
{
        "id": 1,
        "name": "factotum",
        "updatedAt": "2015-05-31 00:36:54.739040784 -0300 BRT"
 }

This approach doesn't mess the document's JSON with type information and is compatible with ES and SOLR.

Pros: Easy to implement; Cons: Very ugly. HTTP URL limitation length;

Anyone have another idea for document-mapping?

The three ways of provide document-mappings described above aren't troubles for Neoway's use-cases. Then we are free to choice the best.

Solution 2: Switch from JSON to BSON

MongoDB solves the problem of JSON lack of types with BSON that have properly typed fields. I think that this can be a good way of solve this and other problems, but we will have to drop our REST service and starts to use a simple TCP server that will handle connections from clients that speak BSON. Maybe we can start using the MongoDB clients available.

Pros: No need for library refactor; Our REST service is still a Proof of Concept. Cons: Impossible to have an HTTP REST service;

Problem 2: Field type changes

The second problem is, how we will handle the indexed documents if a field type changes in the future?

Today we have different approaches for storing integer, strings and floats. If the first document has an integer for a field, the next documents are forbidden to have another type for this field. This is a pain in the ass for the kind of data that changes frequently.

Solution1: Index by types

If the Problem 1 is solved, we will know exactly what the type of each field when indexing. We can store the data for each field type separately on storage. For it to work, we need some changes in NS internals. Instead of one reverse-index of each field, we will have one for each data type by field.

For example, to index the document below:

{
     "id": 1,
     "name": "factotum",
     "updatedAt": "2015-05-31 00:36:54.739040784 -0300 BRT"
}

Knowing that "updatedAt" is of type "date", the commands below need be executed to storage engine:

USING "<index-name>.document.db" SET uint(1) '<body-of-document>';
USING "<index-name>.id_uint.idx" MERGESET uint(1) uint(1);
USING "<index-name>.name_str.idx" MERGESET "factotum" uint(1);
USING "<index-name>.updatedAt_date.idx" MERGESET uint(<UTC epoch>) uint(1);

The commands above will create the filesystem hierarchy below:

/data/<index-name>/
/data/<index-name>/document.db
/data/<index-name>/id_uint.idx
/data/<index-name>/name_str.idx
/data/<index-name>/updatedAt_date.idx

If now we index another document with a different field type for id, this isn't a problem:

{
        "id": "ee7cbc2b-1f26-42e4-ad94-a4d81929bcc5",
        "name": "factotum",
        "updatedAt": "2015-05-31 00:36:54.739040784 -0300 BRT"
 }

And now the commands and filesystem will be:

USING "<index-name>.document.db" SET uint(1) '<body-of-document>';
USING "<index-name>.id_str.idx" MERGESET "ee7cbc2b-1f26-42e4-ad94-a4d81929bcc5" uint(1);
USING "<index-name>.name_str.idx" MERGESET "factotum" uint(1);
USING "<index-name>.updatedAt_date.idx" MERGESET uint(<UTC epoch>) uint(1);
/data/<index-name>/
/data/<index-name>/id_uint.idx
/data/<index-name>/id_str.idx
/data/<index-name>/name_str.idx
/data/<index-name>/updatedAt_date.idx

More problems ? Other ideas?

cc @katcipis @richard-ps @rzanato @lucas-depaula @ppizarro

katcipis commented 9 years ago

Ill try to help, but be patient with my ignorance :-). Here we go..

To be or not to be schemaless

Reading the post you passed about ES I understood that ES problem with being schemaless is not because a schemaless search engine is impossible, but because Lucene is not schemaless...they tried to build schemaless stuff on top of something that is not schemaless.

My question:

Besides the date problem, do we need a schema ? (does some of our storage backends inforce a schema like lucene ?)

If the problem is only the date, couldn't we find a solution just for that problem ? (it seems excessive to create a full mapping/schema thing just because of dates).

BSON thing

I dont think this would make impossible to use JSON (and a REST API), MongoDB actually lets you work with JSON and a BSON conversion is done, but you would end up with the initial schemaless / date / thing problem :-)

And, I have the feeling it is kinda excessive if the problem is only dates.

Indexing by types

The solution seems ok to me, but we end up on the problem of inferring the type of the field, which is ok to all types, except dates :-).

After we solve the date problem, it seems that indexing by types would be a good idea.

ppizarro commented 9 years ago

Problem 1

Data in MongoDB has a flexible schema. MongoDB’s collections do not enforce document structure. This flexibility facilitates the mapping of documents to an entity or an object. Each document can match the data fields of the represented entity, even if the data has substantial variation. In practice, however, the documents in a collection share a similar structure.Mongo fields can vary from document to document; there is no need to declare the structure of documents to the system – documents are self-describing. If a new field needs to be added to a document then the field can be created without affecting all other documents in the system, without updating a central system catalog, and without taking the system offline. With the intuitive document data model, dynamic schema and idiomatic drivers, you can build applications and get to market faster with MongoDB.

Despite of this, the data in the Mongo has type:

http://docs.mongodb.org/manual/reference/bson-types/

I like the idea of using bson.

BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type. Bson is lightweight keep­ing spa­tial over­head to a min­im­um is im­port­ant for any data rep­res­ent­a­tion format, es­pe­cially when used over the net­work. It is efficient. En­cod­ing data to BSON and de­cod­ing from BSON can be per­formed very quickly in Go lan­guage. We can avoid the overhead of establishing connection to each REST.

We need smarter clients. This can be a good idea to solve other problems such as sharding. The client is responsible for converting the data to a bson type.

Problem 2

I don't get why you have to create a separate index for each field type.

Cheers,

Paulo

2015-05-31 13:23 GMT-03:00 Tiago César Katcipis notifications@github.com:

Ill try to help, but be patient with my ignorance :-). Here we go.. To be or not to be schemaless

Reading the post you passed about ES I understood that ES problem with being schemaless is not because a schemaless search engine is impossible, but because Lucene is not schemaless...they tried to build schemaless stuff on top of something that is not schemaless.

My question:

Besides the date problem, do we need a schema ? (does some of our storage backends inforce a schema like lucene ?)

If the problem is only the date, couldn't we find a solution just for that problem ? (it seems excessive to create a full mapping/schema thing just because of dates). BSON thing

I dont think this would make impossible to use JSON (and a REST API), MongoDB actually lets you work with JSON and a BSON conversion is done, but you would end up with the initial schemaless / date / thing problem :-)

And, I have the feeling it is kinda excessive if the problem is only dates. Indexing by types

The solution seems ok to me, but we end up on the problem of inferring the type of the field, which is ok to all types, except dates :-).

After we solve the date problem, it seems that indexing by types would be a good idea.

— Reply to this email directly or view it on GitHub https://github.com/NeowayLabs/neosearch/issues/27#issuecomment-107216726 .

katcipis commented 9 years ago

Answering @ppizarro

Problem 1

I like the idea of smart clients, it is exactly the same idea used on Aerospike. You can do a lot of neat tricks and optimizations, even reducing latencies when you write/read data. One of the disadvantages is that writing new clients can be harder (the more smart the client is...the worse).

Perhaps this will not be a problem that we have to handle... but since we are converging to a more service oriented architecture...loosely coupled...tech agnostic...the chances of people wanting to use different languages to solve problems will be greater.

But thinking about it...at least at Neoway... we already have a search service that provides more domain specific search features, hiding neosearch, so the impact of having a smart client does not seem so horrible :-)

Although, abandoning REST we lose all infrastructure and tools related to it (the traditional tradeoff between a simple text protocol to a binary more optimized one)

Problem 2

Im not sure...but I think the backend APIs requires that (leveldb...etc). Also, it seems that will be easier to implement join (I think we are not going to join a index if the type is different, it would be type coercion...and in my opinion it is usually not a good idea :-)

i4ki commented 9 years ago

I started the implementation of document-mapping in the library API at c57dfbb59de97e73cbadf1a3cd2729c88919ff75. I changed the name from mapping to metadata as @katcipis suggested. I will leave the question regarded the client API for metadata open until we get a good solution, but now I think we should continue with the REST API until release v0.1.

The metadata per document is optional, and the field types will be inferred by reflection if not supplied. If the metadata is supplied and the value isn't of the same type of the metadata, then it will try to convert the value to the appropriate value.

The usage now is like below:

    //
    idx := index.New("companies", Config{}, true)

    docMetadata := index.Metadata{
        "id": index.Metadata{
            "type": "uint",
        },
        "name": index.Metadata{
            "type": "string",
        },
        "employees": index.Metadata{
            "type": "uint",
        },
    }
    idx.Add(1, []byte(`{"id": 1, "name": "Neoway", "employees": "150"}`), docMetadata)

    // You can omit the metadata passing nil
    idx.Add(2, []byte(`{"id": 1, "name": "Others", "employees": 0}`), nil)

In the first document, note that the employees field has a value of type string in the document and uint specified in the metadata. In that case, the value "150" will be converted to unsigned int of value 150.

katcipis commented 9 years ago

@tiago4orion Awesome work man :D

katcipis commented 8 years ago

do we have a full documentation of the available types ?

i4ki commented 8 years ago

NOP... Em 01/08/2015 12:49, "Tiago César Katcipis" notifications@github.com escreveu:

do we have a full documentation of the available types ?

— Reply to this email directly or view it on GitHub https://github.com/NeowayLabs/neosearch/issues/27#issuecomment-126929933 .