Strange behavior on field sorting

jollycar commented 5 years ago

Situation

I am using the method IFSStore.index(String indexName, String hash, String id, String contentType, List indexFields) with several index fields, also "title"
I upload and index several new files with different titles: "my title 1", "my title 2", "my title 3", "my title 4" (I also tried inserting them in a different order: no difference)
When I query for the data with "query/search?index=documents&page=0&size=1000&sort=title&dir=ASC" then the order is as expected ("my title 1", "my title 2", "my title 3", "my title 4") but if I upload new files with the titles "2018-10-30 16:58:47", "2018-10-30 16:59:11", "2018-10-30", "16:19:48" then the order (after querying) is completely wrong ("2018-10-30 16:58:47", "2018-10-30 16:19:48", "2018-10-30 16:59:11") Other behavior:
When I only use hex-strings as titles (0-9a-f), the order is also correct
When I mix the previous entries ("my title 1", "my title 2", "my title 3", "my title 4") with the other entries ("2018-10-30 16:58:47", "2018-10-30 16:19:48", "2018-10-30 16:59:11"), the order is also wrong: ("my title 1", "2018-10-30 16:58:47", "2018-10-30 16:19:48", "2018-10-30 16:59:11", "my title 2", "my title 3", "my title 4")

gjeanmart commented 5 years ago

Hi, Thanks for reporting this issue. After investigation, this is due to the fact that ElasticSearch uses the Standard Tokenizer (Standard Token Filter - Lower Case Token Filter - Stop Token Filter) for text field by default. So basically when you index a document { "title": "my title 1"} , the index creates three references to this document (my, title, 1) to allow full text search natively.

But in terms of sorting, it sorts all references by alphabetical order and remove duplicate after:

1
2
3
4
2018-10-30
my
title

That's how ElasticSearch works by default and this can be tweaked a little bit but not really through IPFS-Store.

As a short term solution, you can configure manually the index field mapping in ElasticSearch like this:

POST http://127.0.0.1:9200/documents/documents/_mapping
{
    "documents": {
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
}

Run the query (with sort=title.raw):

GET query/search?index=documents&page=0&size=1000&sort=title.raw&dir=ASC"

In the future, I will try make IPFS-Store more configurable for this.

Thanks again for raising this issue!

Greg

gjeanmart commented 5 years ago

As per the recent refactoring, it is now possible to pre-configure an ElasticSearch index mapping in the API in order to pre-create the index on startup with the necessary index fields mapping.

In your case, you could create a mapping file _indexmapping.json like this:

{
  "mappings": {
    "_doc": {
      "properties": {
        "__hash": {
          "type": "keyword"
        },
        "__content_type": {
          "type": "keyword"
        },
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

The index field title is indexed in two manners:

text: use for full text and approximate search
keyword: use for exact search and sorting

(_hash and contenttype are two required fields by Mahuta)

Once the file setting on the server, we need to pass the following arguments to the API

-Dspring-boot.run.arguments=--mahuta.elasticsearch.host=localhost, --mahuta.elasticsearch.port=9300, --mahuta.elasticsearch.clusterName=docker-cluster, --mahuta.elasticsearch.indexConfigs={"name":"document", "map":"index_mapping"}

indexConfigs takes an array of index name / config, so the API creates accordingly these index with the config on startup.

Consensys / Mahuta

Strange behavior on field sorting #46