OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.
https://orchardcore.net
BSD 3-Clause "New" or "Revised" License
7.45k stars 2.41k forks source link

Correctly create custom analyzer in Elasticsearch #17013

Closed MikeAlhayek closed 1 week ago

MikeAlhayek commented 1 week ago

Fix #17012

/cc: @denispetrische

denispetrische commented 1 week ago

Let me see.. This pr will disable adding custom token filters, but in the version in main branch that perfectly works.

"OrchardCore_Elasticsearch": {
  "ConnectionType": "SingleNodeConnectionPool",
  "Url": "http://localhost",
  "Ports": [ 9200 ],
  "Analyzers": {
    "default": {
      "type": "standard",
      "filter": [
        "lowercase",
        "english_stop",
        "english_stemmer"
      ]
    },
    "custom_analyzer": {
      "type": "standard",
      "filter": [
        "lowercase",
        "english_stop",
        "english_stemmer",
        "english_synonyms"
      ]
    }
  },
  "Filter": {
    "english_stop": {
      "type": "stop",
      "stopwords": "_english_"
    },
    "english_stemmer": {
      "type": "stemmer",
      "language": "english"
    },
    "english_synonyms": {
      "type": "synonym_graph",
      "synonyms": ["page,cage"]
    }
  }
}

In this pr version analyzer is created without token filters if specified like in configuration above. image

if to use your configuration like in https://github.com/OrchardCMS/OrchardCore/issues/17012 yeah - it works, but if I only add my custom token filter - elastic index is not created

"OrchardCore_Elasticsearch": {
  "ConnectionType": "SingleNodeConnectionPool",
  "Url": "http://localhost",
  "Ports": [ 9200 ],
  "Analyzers": {
    "standard": {
      "type": "standard"
    },
    "keyword": {
      "type": "keyword"
    },
    "whitespace": {
      "type": "whitespace"
    },
    "knowledgebase_analyzer": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "stop",
        "english_synonyms"
      ],
      "char_filter": [
        "html_strip"
      ]
    }
  },
  "Filter": {
    "english_stop": {
      "type": "stop",
      "stopwords": "_english_"
    },
    "english_stemmer": {
      "type": "stemmer",
      "language": "english"
    },
    "english_synonyms": {
      "type": "synonym_graph",
      "synonyms": [ "page,cage" ]
    }
  }
}

I need to understand what exactly no longer work, because everything in main branch seems to work for me.

Anyway we need to be able to add custom token filters that listed in the "Filter" section to any analyzer

denispetrische commented 1 week ago

I think I found the reason, I will sugest changes soon

MikeAlhayek commented 1 week ago

@denispetrische can you checkout this branch and test out filters? Note that I renamed "Filter" to "TokenFilters"

so in your appsettings use "TokenFilters" instead of "Filter" to define the custom token-filters.

denispetrische commented 1 week ago

@denispetrische can you checkout this branch and test out filters? Note that I renamed "Filter" to "TokenFilters"

so in your appsettings use "TokenFilters" instead of "Filter" to define the custom token-filters.

I tested locally "fix logs message" and made my own pull request based on it https://github.com/OrchardCMS/OrchardCore/pull/17019. Check this. This works with every analyzer and custom token filters.

I don't think that's a good idea to rename "Filter" section to "TokenFilters" because it is how it called in elastic terminology,

MikeAlhayek commented 1 week ago

@denispetrische

I tested locally "fix logs message"

Did you test this branch?

made my own pull request based

It better to suggest a fix here instead of create a new PR to avoid redundant work.

I don't think that's a good idea to rename "Filter" section to "TokenFilters" because it is how it called in elastic terminology,

Elastic docs referrers to the as Token filters https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

It's better to not name them "Filters" so they are not configured with filters property when defining custom analyzer.

denispetrische commented 1 week ago

Did you test this branch?

  • Yeah, it still doesn't work when I create my own analyzer with custom token filters

Ok, I'll suggest them here

MikeAlhayek commented 1 week ago

@denispetrische I have cleaned up the code quite bit. Please revisit your suggestions and see if you still have any suggestions. The code with it's current state is much cleaner.

I just tested creating a new token-filter and use it in my custom analyzer.

This is what I specified in the appsettings:

  "Analyzers": {
    "standard": {
      "type": "standard"
    },
    "keyword": {
      "type": "keyword"
    },
    "whitespace": {
      "type": "whitespace"
    },
    "knowledgebase_analyzer": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "stop",
        "english_stop"
      ],
      "char_filter": [
        "html_strip"
      ]
    }
  },
  "TokenFilters": {
    "english_stop": {
      "type": "stop",
      "stopwords": "_english_"
    }
  }

Then created an index called tf_test using my custom knowledgebase_analyzer analyzer. When I directly query the index settings using Kabana I get the expected structure

{
  "ptest_blog2_tf_test": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "ptest_blog2_tf_test",
        "creation_date": "1731606374163",
        "analysis": {
          "filter": {
            "english_stop": {
              "type": "stop",
              "stopwords": "_english_"
            }
          },
          "analyzer": {
            "knowledgebase_analyzer": {
              "filter": [
                "lowercase",
                "stop",
                "english_stop"
              ],
              "char_filter": [
                "html_strip"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "lDrk5TeTTaiD9lU27YUumg",
        "version": {
          "created": "8060199"
        }
      }
    }
  }
}

So the changes work with no problem.

denispetrische commented 1 week ago
"OrchardCore_Elasticsearch": {
  "ConnectionType": "SingleNodeConnectionPool",
  "Url": "http://localhost",
  "Ports": [ 9200 ],
  "Analyzers": {
    "standard": {
      "type": "standard"
    },
    "keyword": {
      "type": "keyword"
    },
    "whitespace": {
      "type": "whitespace"
    },
    "knowledgebase_analyzer": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "stop",
        "english_synonyms"
      ],
      "char_filter": [
        "html_strip"
      ]
    }
  },
  "Filter": {
    "english_stop": {
      "type": "stop",
      "stopwords": "_english_"
    },
    "english_stemmer": {
      "type": "stemmer",
      "language": "english"
    },
    "english_synonyms": {
      "type": "synonym_graph",
      "synonyms": [ "page,cage" ]
    }
  }
}

Yes, now that works for me as well. And the code quality is much better now. Good job)