buda-base / public-digital-library

http://library.bdrc.io
5 stars 6 forks source link

fix etext highlight #903

Open eroux opened 3 months ago

eroux commented 3 months ago

The current way OpenSearch highlights etext fields (and other fields too I guess) is to highlight all the tokens in the result that are in the query, which means that if the query contains "pa", it will highlight all the "pa" in the etext result, making it very noisy. It should make its best to only highlight tokens matching the query.

Two possible strategies are:

eroux commented 3 months ago

this query seems to be working:

{
  "query": {
    "bool": {
      "should": [
        {
          "has_child": {
            "type": "etext",
            "query": {
              "nested": {
                "path": "chunks",
                "query": {
                  "match_phrase": {
                    "chunks.text_bo": "བྱ་ངང་བ་སེར་བོ་མཚོ་"
                  }
                },
                "inner_hits": {
                  "highlight": {
                    "fields": {
                      "chunks.text_bo": {
                        "highlight_query": {
                          "match_phrase": {
                            "chunks.text_bo": "བྱ་ངང་བ་སེར་བོ་མཚོ་"
                          }
                        }
                      }
                    }
                  }
                }
              }
            },
            "inner_hits": {
              "_source": {
                "includes": ["id"]
              },
              "highlight": {
                "fields": {
                  "chunks.text_bo": {
                    "highlight_query": {
                      "match_phrase": {
                        "chunks.text_bo": "བྱ་ངང་བ་སེར་བོ་མཚོ་"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      ]
    }
  }
}
berger-n commented 3 months ago

just found out that the highlighted text is sometimes at the edge of the returned fragment: link

image