Consensys / Mahuta

IPFS Storage service with search capability
Apache License 2.0
237 stars 49 forks source link

Strange behavior on field sorting #46

Closed jollycar closed 5 years ago

jollycar commented 5 years ago

Situation

gjeanmart commented 5 years ago

Hi, Thanks for reporting this issue. After investigation, this is due to the fact that ElasticSearch uses the Standard Tokenizer (Standard Token Filter - Lower Case Token Filter - Stop Token Filter) for text field by default. So basically when you index a document { "title": "my title 1"} , the index creates three references to this document (my, title, 1) to allow full text search natively.

But in terms of sorting, it sorts all references by alphabetical order and remove duplicate after:

That's how ElasticSearch works by default and this can be tweaked a little bit but not really through IPFS-Store.

As a short term solution, you can configure manually the index field mapping in ElasticSearch like this:

POST http://127.0.0.1:9200/documents/documents/_mapping
{
    "documents": {
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
}

Run the query (with sort=title.raw):

GET query/search?index=documents&page=0&size=1000&sort=title.raw&dir=ASC"

In the future, I will try make IPFS-Store more configurable for this.

Thanks again for raising this issue!

Greg

gjeanmart commented 5 years ago

As per the recent refactoring, it is now possible to pre-configure an ElasticSearch index mapping in the API in order to pre-create the index on startup with the necessary index fields mapping.

In your case, you could create a mapping file _indexmapping.json like this:

{
  "mappings": {
    "_doc": {
      "properties": {
        "__hash": {
          "type": "keyword"
        },
        "__content_type": {
          "type": "keyword"
        },
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

The index field title is indexed in two manners:

(_hash and contenttype are two required fields by Mahuta)

Once the file setting on the server, we need to pass the following arguments to the API

-Dspring-boot.run.arguments=--mahuta.elasticsearch.host=localhost, --mahuta.elasticsearch.port=9300, --mahuta.elasticsearch.clusterName=docker-cluster, --mahuta.elasticsearch.indexConfigs={"name":"document", "map":"index_mapping"}

indexConfigs takes an array of index name / config, so the API creates accordingly these index with the config on startup.