amundsen-io / amundsen

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
https://www.amundsen.io/amundsen/
Apache License 2.0
4.39k stars 954 forks source link

Column filter doesn't work #2120

Closed beauttie closed 1 year ago

beauttie commented 1 year ago

Expected Behavior

If I search for a column by the exact name or a string with a wildcard, I would expect to see search results listing tables containing that column.

Current Behavior

Currently, no search results are returned when applying the column filter as shown in the screenshot below.

Screen Shot 2023-03-20 at 10 01 20 PM

I also see this same error message when searching for a column by the exact name.

Possible Solution

I was able to resolve this bug by changing the value of the column field in this line to column_names.keyword as shown in the screenshot below.

Screen Shot 2023-03-20 at 9 58 27 PM

Steps to Reproduce

I ran the search, metadata, and frontend service locally on my computer. I used this AwsSearchConfig module for the search, a DevConfig module that connects to a Neo4j proxy client for the metadata, and this LocalConfig module for the frontend.

Context

I couldn't deploy a version of the frontend with the column filter as it would incorrectly return no search results. As instructed in these docs, I ran the configured SearchMetadatatoElasticsearchTask and confirmed that the new table index has the new mappings. If I get the new index via the Kibana console, what I do notice is that there is a column_names property in addition to the columns property.

"mappings" : {
      "_meta" : {
        "version" : 2
      },
      "properties" : {
        ...,
        "column_names" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "columns" : {
          "type" : "text",
          "fields" : {
            "general" : {
              "type" : "text",
              "term_vector" : "with_positions_offsets",
              "analyzer" : "general_analyzer"
            },
            "keyword" : {
              "type" : "keyword"
            },
            "ngram" : {
              "type" : "text",
              "term_vector" : "with_positions_offsets",
              "analyzer" : "ngram_analyzer_table_columns"
            }
          },
          "term_vector" : "with_positions_offsets",
          "analyzer" : "stemming_analyzer"
        },
        ...
     }

I also called the Elasticsearch REST APIs to confirm what I was seeing above based on the request body parameters logged out by es_proxy_v2_1. For example, when I submit this GET request, I get no hits:

GET table_search_index_v2_1/_search
{"query": {"bool": {"filter": [{"bool": {"should": [{"wildcard": {"columns.keyword": "*key"}}], "minimum_should_match": 1}}]}}, "from": 0, "size": 10, "highlight": {"fields": {"name": {"type": "fvh", "number_of_fragments": 0}, "description": {"type": "fvh", "number_of_fragments": 0}, "columns.general": {"type": "fvh", "number_of_fragments": 10, "order": "score"}, "column_descriptions": {"type": "fvh", "number_of_fragments": 5, "order": "score"}}}}

whereas this GET request (same as above except for the key after the wildcard key) returns hits

GET table_search_index_v2_1/_search
{"query": {"bool": {"filter": [{"bool": {"should": [{"wildcard": {"column_names.keyword": "*key"}}], "minimum_should_match": 1}}]}}, "from": 0, "size": 10, "highlight": {"fields": {"name": {"type": "fvh", "number_of_fragments": 0}, "description": {"type": "fvh", "number_of_fragments": 0}, "columns.general": {"type": "fvh", "number_of_fragments": 10, "order": "score"}, "column_descriptions": {"type": "fvh", "number_of_fragments": 5, "order": "score"}}}}

Your Environment

Amundsen version used:

I also have an AWS OpenSearch Service (OSS) domain using Elasticsearch 7.10.

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here!

allisonsuarez commented 1 year ago

You should not have both a columns and column_names field in your mappings. For us we only have columns and the existing filtering functionality works for the column filter. Do you periodically drop and recreate your indices? If so does it create both fields every time?

allisonsuarez commented 1 year ago

I just made a small fix, this is a python inheritance issue that makes it so the mapping of the parent class is used rather than the newest search class. Thanks for raising this! https://github.com/amundsen-io/amundsen/pull/2121

beauttie commented 1 year ago

You should not have both a columns and column_names field in your mappings. For us we only have columns and the existing filtering functionality works for the column filter. Do you periodically drop and recreate your indices? If so does it create both fields every time?

We recreate the index daily, and it creates both fields every time.

beauttie commented 1 year ago

@allisonsuarez I just tried building with your fix in this line, but it doesn't fix the issue. Would this rather be an issue with how the mappings are created in amundsendatabuilder?

allisonsuarez commented 1 year ago

@beauttie are you using the packages?the release for that fix was just made

beauttie commented 1 year ago

@allisonsuarez I updated what was in the release for search-4.1.1 for the metadata, search, and frontend service, and I still see the same bug. Again, I think this is an issue with how the mappings are created in amundsendatabuilder. We currently use the latest version 7.4.3.

kristenarmes commented 1 year ago

hi @beauttie, I'm going to close this issue as a triaged state, and people can upvote it or someone can choose to work on the issue from there