elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.56k forks source link

[Transform] Transforms with unattended flag don't create destination index unless all conditions/fields exist in source index #104146

Open susan-shu-c opened 8 months ago

susan-shu-c commented 8 months ago

Description

We added the unattended flag to transforms shipped in integration packages (example: https://github.com/elastic/integrations/pull/8320).

In the past, without the unattended flag, once the package is installed on a fresh cluster:

Now, with the unattended flag, on a fresh cluster:

After testing on v8.11.1 (so that this fix https://github.com/elastic/elasticsearch/pull/101627 would be there), transforms with the unattended flag don't seem to create the destination index like without the unattended flag.

It turns out, the destination index is only created when there is exact data that matches the criteria (e.g. fields host.name, destination.ip, etc. exist in logs-*) for the transform to run, compared to before, the destination index can be created regardless. This gives the impression that the package hasn't fully been installed.

What we want to clarify is: Is this expected behavior with the unattended flag? If so, can it be implemented so the behavior is the same as before (create destination index regardless of available data) so that it's clearer to users when the transform and associated indices have been created?

Related links

elasticsearchmachine commented 8 months ago

Pinging @elastic/ml-core (Team:ML)

przemekwitek commented 8 months ago

Hi,

In the past, without the unattended flag, once the package is installed on a fresh cluster: Transform is installed Destination index is created on install .latest and .all aliases for the destination index is created

Confirm, that's the correct behavior with unattended set to false

After testing on v8.11.1 (so that this fix https://github.com/elastic/elasticsearch/pull/101627 would be there), transforms with the unattended flag don't seem to create the destination index like without the unattended flag.

That's true. In case of unattended transform, we explicitly skip destination index creation on _start call (which is a part of package install).

It turns out, the destination index is only created when there is exact data that matches the criteria (e.g. fields host.name, destination.ip, etc. exist in logs-*) for the transform to run

In case of unattended, the destination index is created when the first document is being written to it. So, if there is proper data in source index, you'll eventually see the destination index created and the first results written to it.

This gives the impression that the package hasn't fully been installed.

I understand your concern here. Without explicit destination index creation, it is less predictable when exactly the destination index (and its aliases) will be set up.

Is this expected behavior with the unattended flag?

Confirm, working as intended.

If so, can it be implemented so the behavior is the same as before (create destination index regardless of available data) so that it's clearer to users when the transform and associated indices have been created?

I'll need to think how such a change would fit the current codebase. I'll ping this issue soon.

przemekwitek commented 7 months ago

@susan-shu-c, it seems you can achieve what you need by reverting the transform to non-unattended. Precisely, you want these 2 settings in your transform config:

  "settings": {
    "unattended": false,
    "num_failure_retries": -1
  }

This way the transform will not be unattended (so it will create destination index just like it used to) but at the same time it will retry most of the failures indefinitely (without limit). Having said that, there will still be failures that will not be retried (like script exception) so the transform will not be fully unattended.

Are there any reasons (other than indefinite retry limit) that made you switch to unattended?

susan-shu-c commented 7 months ago

Pasting our Slack conversation for reference:

We added unattended: true so that the install would work on Serverless

(as requested by Sophie Chang, not going to link it here as it was an internal GitHub discussion)

przemekwitek commented 6 months ago

I was able to reproduce the issue locally. The problem is that if the transform destination index is created dynamically (not on _start_ but later during indexing), then we do not set up this index' aliases. This is a bug that we need to fix in our backend code.

przemekwitek commented 6 months ago

FYI: I have opened a PR with the fix (https://github.com/elastic/elasticsearch/pull/105499).

susan-shu-c commented 6 months ago

Awesome, thank you! So with #105499 we can install packages with unattended: true or unattended: false and in both cases, the destination index will be created on package install?

przemekwitek commented 6 months ago

Awesome, thank you! So with https://github.com/elastic/elasticsearch/pull/105499 we can install packages with unattended: true or unattended: false and in both cases, the destination index will be created on package install?

Not exactly. This bugfix makes destination index and its aliases set up correctly once the transform sees source indices and is ready to start processing them. This should solve your immediate problem of missing aliases and should be enough for your setup to work correctly (but of course let us know if it is not the case and there are further issues).

Creating destination index before source indices are ready is a more complex topic that we want to tackle too, but we won't have any solution for it in 8.13. We need to re-design the transform's workflow to accommodate this change, that's why we don't want to rush it before feature freeze.