elastic / connectors

Official Elastic connectors for third-party data sources
https://www.elastic.co/guide/en/elasticsearch/reference/master/es-connectors.html
Other
18 stars 136 forks source link

ensure_content_index_mappings doesn't ensure mappings #2581

Open sakurai-youhei opened 5 months ago

sakurai-youhei commented 5 months ago

Bug Description

ensure_content_index_mappings does nothing actually because connector setup sets some mappings to the index on creation. This issue makes id and other fields (code) rely on dynamic mapping, which may cause mapping problems.

For example, network_drive.py will probably type id as (signed) long, but it can index documents with unsigned long values, which will end up with document_parsing_exception in Elasticsearch.

[FMWK][04:24:31][ERROR] 
  [Connector id: XH_9rY8BHCAv43wWFIhC, index name: test, Sync job id: C63-rY8B_7UHF08UH-5L] 
  operation index failed, {'type': 'document_parsing_exception', 'reason': "[1:108] failed to parse field [id] of type [long] in document with id '12368218145543858784'. 
  Preview of field's value: '12368218145543858784'", 'caused_by': {'type': 'x_content_parse_exception', 'reason': '[1:128] Numeric value (12368218145543858784) out of range of long (-9223372036854775808 - 9223372036854775807)\n at ...

Reproducer

  1. Spin up Elastic stack 8.13.4.
  2. Go to Kibana > Search > Content > Connectors.
  3. Create a new connector with Network drive.
  4. Click Create and attach an index named XXX.
  5. Click Convert connector on the right.
  6. Click Generate API key.
  7. Click Edit configuration, fill in the settings, and click Save configuration.
  8. Run elastic-ingest -c /path/to/config.yaml --debug with the config given on the UI.
  9. Click Sync > Full Content.

Mappings before the sync

{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "all_text_fields": {
          "match_mapping_type": "string",
          "mapping": {
            "analyzer": "iq_text_base",
            "fields": {
              "delimiter": {
                "analyzer": "iq_text_delimiter",
                "type": "text",
                "index_options": "freqs"
              },
              "joined": {
                "search_analyzer": "q_text_bigram",
                "analyzer": "i_text_bigram",
                "type": "text",
                "index_options": "freqs"
              },
              "prefix": {
                "search_analyzer": "q_prefix",
                "analyzer": "i_prefix",
                "type": "text",
                "index_options": "docs"
              },
              "enum": {
                "ignore_above": 2048,
                "type": "keyword"
              },
              "stem": {
                "analyzer": "iq_text_stem",
                "type": "text"
              }
            }
          }
        }
      }
    ]
  }
}

Mappings after the sync

{
  "mappings": {
    "dynamic": "true",
    "dynamic_templates": [
      {
        "all_text_fields": {
          "match_mapping_type": "string",
          "mapping": {
            "analyzer": "iq_text_base",
            "fields": {
              "delimiter": {
                "analyzer": "iq_text_delimiter",
                "type": "text",
                "index_options": "freqs"
              },
              "joined": {
                "search_analyzer": "q_text_bigram",
                "analyzer": "i_text_bigram",
                "type": "text",
                "index_options": "freqs"
              },
              "prefix": {
                "search_analyzer": "q_prefix",
                "analyzer": "i_prefix",
                "type": "text",
                "index_options": "docs"
              },
              "enum": {
                "ignore_above": 2048,
                "type": "keyword"
              },
              "stem": {
                "analyzer": "iq_text_stem",
                "type": "text"
              }
            }
          }
        }
      }
    ],
    "properties": {
      "_timestamp": {
        "type": "date"
      },
      "created_at": {
        "type": "date"
      },
      "id": {
        "type": "long"
      },
      "path": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      },
      "size": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      },
      "type": {
        "type": "text",
        "fields": {
          "delimiter": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "iq_text_delimiter"
          },
          "enum": {
            "type": "keyword",
            "ignore_above": 2048
          },
          "joined": {
            "type": "text",
            "index_options": "freqs",
            "analyzer": "i_text_bigram",
            "search_analyzer": "q_text_bigram"
          },
          "prefix": {
            "type": "text",
            "index_options": "docs",
            "analyzer": "i_prefix",
            "search_analyzer": "q_prefix"
          },
          "stem": {
            "type": "text",
            "analyzer": "iq_text_stem"
          }
        },
        "analyzer": "iq_text_base"
      }
    }
  }
}

Expected behavior

ensure_content_index_mappings should add missing mappings.

Workaround

Recreate the index by not using the Connectors Configuration page, and re-run sync.

Environment

Additional context

I will open a PR to fix address this issue.

seanstory commented 5 months ago

Hi @sakurai-youhei, thanks for filing.

As you've identified, there are a few different places where mapping definitions can come from. We don't really want to enforce one specific place to be the "source of truth", because that would limit our users to requiring that mechanism for creating their connectors. For instance, if connectors python code is the source of truth, that prevents folks from setting up their explicit mappings with Elasticsearch APIs. If Kibana is the source of truth, it prevents folks from using the CLI and bypassing Kibana.

For your issue, I'd suggest an iterative approach. The connector has given you a reasonable first-pass at mappings, but there is plenty of room for improvement. For you, "long" may actually not be a good type for ID - I'd suggest keyword instead, since. you won't need to do mathematical operations based on the ID of your docs, but instead want to use it as an exact-match identifier. And I expect you don't care about how many characters it has.

You could either use Elasticsearch Reindex APIs, or just use a text editor to make changes to the current mapping JSON and create a new index with your manually edited mappings. Then set up a new connector to attach to your new index.

In future versions, we hope to make this kind of mapping iteration more simple, but for now it will require some manual effort on your part.

sakurai-youhei commented 5 months ago

@seanstory Thank you for your comment. It's good to know the preference for the no-single "source of truth" concept. I understand and have no doubt about it.

On the other hand, I think elastic-ingest shouldn't left the id field mapping undefined in this case because I, as a user, don't request the field explicitly, and I was not aware of the field until I encountered the trouble. A user requests a new connector with an expectation that details will be well-organized implicitly.

So, I opened https://github.com/elastic/connectors/pull/2588 to make elastic-ingest responsible for the mapping in case that it was left undefined until the very last moment. But please let me know if some other component, such as Kibana, should be responsible for the mapping in this case, and I will change my direction accordingly. :) Thanks again.

seanstory commented 5 months ago

@sakurai-youhei ok, I think I'd originally misunderstood your concern as being wider than it may be. Let's make sure we're on the same page where it comes to scope.

Your linked PR would solve your specific issue for id fields. It wouldn't solve a similar issue for a foo field that should be a keyword, but the first doc's value looks numeric, so Elasticsearch dynamically gives the field a long type, and then subsequent documents hit errors. In that scenario, you'd still need to manually craft your mapping before your sync, or go through a reindex process. This is unlikely to change in the near future, as its pretty difficult to generically have connectors guess about the fields they'll produce without actually running a sync to be sure.

As long as you're not looking to solve this more generic class of problem, and are planning on only fixing this niche case where the problematic field is id, I think we can move forward. But before we review your PR, I want to be sure that we're aligned on what you're expecting to solve.

seanstory commented 5 months ago

CC @sphilipse @efegurkan. If the goal is just to add id to the mappings, I'd be more tempted to do that in Kibana directly. But it's unclear to me if we'd have a good way to do that only for indices that are intended to be used for connectors. Curious to have your thoughts.

sakurai-youhei commented 5 months ago

@seanstory My ultimate aim is to save users (like me) from troubles in the Elasticsearch onboarding process with connectors like the one I experienced. My expectation was I’d get documents indexed by connectors after following instructions but the actual observation is most of documents are not indexed due to the mapping issue on the id field. Then, I found the ensure_content_index_mappings doesn’t ensure mappings so I thought this is a bug and started working on fixing it.

I think the mapping issue should be resolved somehow anyway. On the other hand, what to be fixed is not limited to connectors, as you mentioned the possible fix in Kibana, but I’d say as generic approach is preferable to prevent not only this specific id mapping issue but similar mapping issues on other fields. IMHO, one advantage fixing this issue in connectors is encapsulation like connectors should be responsible for preparing mappings to be used by connectors.

My contribution is just a trigger. I just hope this and similar mapping issues would be resolved somehow. ;-)

seanstory commented 5 months ago

I’d say as generic approach is preferable to prevent not only this specific id mapping issue but similar mapping issues on other fields.

That's my point though - this will only fix the id field. Individual connectors don't currently advertize/suggest mappings for their data. This is only done at the framework level.

connectors should be responsible for preparing mappings to be used by connectors

I don't disagree, that's just not where we are now, and I'm not sure we'll invest the time to make that kind of change, since there's not a good generic solution for all connectors. Sharepoint has pretty cut-and-dry output schemas that could be turned into a good explicit mapping. Salesforce has some predictable outputs, but custom fields add a layer of uncertainty. Database connectors (SQL/Mongo) have no reliable output fields or types.

Right now, it's more likely that we'll invest in making iterating on your mappings easier, since that can benefit all our connectors and therefore all our customers. Even though everyone would prefer to have their mappings "just work" the first time round, the next best thing is to have them work on a retry, after a low-effort change.

sakurai-youhei commented 5 months ago

@seanstory Thank you very much! All clear to me, and I noticed I didn't know and think of the real and the actual circumstances of connectors other than the specific Network drive. I leave #2588 and will close it once anything else is decided. Thanks again.