[Logstash Centralized Config Management] Ancillary Configs

Original comment by @ycombinator:

Motivation

Certain Logstash plugins in a pipeline configuration can accept references to ancillary configuration files. The plugins read these files and use their contents as part of the plugin's execution in the pipeline.

For example, users may define a custom grok pattern named FOO in a file named postfix placed under the folder /tmp/custom_grok_patterns/. They can then reference this folder and pattern in the grok filter like so:

grok {
  patterns_dir => [ "/tmp/custom_grok_patterns" ]
  match => { "message" => "... %{FOO} ..." }
}

X-Pack Basic and above license users have the ability to centrally manage their Logstash pipeline configurations. Users can CRUD pipeline configurations in a Kibana Management UI and these configurations are (effectively) pushed out to Logstash nodes for execution.

However, users of centralized configuration management are unable to also centrally manage ancillary configuration files like custom grok patterns today. This proposal details how we might provide that capability.

User stories and corresponding UX

Sysadmin Sally wants to centrally manage some custom grok patterns useful for Postfix log processing
1. Sally visits Kibana Management.
2. Under Logstash, she clicks the (new) Ancillary Pipeline Configs (naming TBD) link.
3. She is presented with a listing of various Ancillary Pipeline Config objects of different types (custom grok pattern collections, translate filter lookup collections, etc.).
4. She clicks the New button on the page and chooses Custom Grok Pattern collection from the New button dropdown menu. [NOTE: Given the rich diversity of ancillary config file types (see Appendix below) we might want to offer a generic file upload option as well].
5. She is presented with a form where she gives her Custom Grok Pattern collection an ID, say postfix_grok_patterns, and populates the actual custom patterns as well, say FOO and BAR. She saves the form, thereby creating the centrally-managed custom grok pattern collection.
Data Analyst Dan wants to use the FOO custom grok pattern from the postfix_grok_patterns collection in his centrally-managed pipeline configuration
1. Dan visits Kibana Management.
2. Under Logstash, he clicks the Pipelines link.
3. He is presented with a listing of various Pipelines.
4. He clicks the New button on the page and starts to create his pipeline configuration.
5. When he reaches the grok filter definition, he references a centrally-managed custom pattern collection like so (exact syntax might need discussion; see open questions below):
```
grok {
 patterns_dir => [ "ccm://postfix_grok_patterns" ]
 match => { "message" => "... %{FOO} ..." 
}
```

Technical design

The current .logstash index was designed to hold pipeline config documents. The document IDs correspond to user-defined pipeline IDs. The mapping has top-level pipeline-specific fields, pipeline and pipeline_metadata.

We could try to store ancilillary configs in the same .logstash index with some mapping changes. Or we could introduce a new .logstash-ancillary-configs (or better/shorter-named :)) index. Details of both options, including pros and cons, are listed below.

Option 1: Reuse `.logstash` index

First, we will need to "make room" for other types of documents in the .logstash index. This means adding a few new fields to the mapping. The new mapping would then look like this:

{
  "dynamic": "strict" // same as before
  "properties": {
    "description": { "type": "text" }, // same as before
    "last_modified": { "type": "date" }, // same as before
    "metadata": { "type": "object", "dynamic": "false" }, // same as before
    "pipeline": { "type": "text" }, // same as before
    "pipeline_metadata": { // same as before
      "properties": {
        "type": { "type": "keyword" },
        "version": { "type": "short" },
        "username": { "type": "keyword" }
      }
    },
    "id": { "type": "keyword" } // NEW
    "type": { "type": "keyword" } // NEW
    "ancillary_config": { "type": "object", "dynamic": "false" }// NEW
  }
}

Additionally, we'd also update the logstash-index-template index template with the above mapping.

When creating/updating pipeline objects we do everything the same as now, notably:

populate the same fields as we have been
set the document _id to the pipeline ID

Additionally, we:

set type to pipeline. Obviously pipelines that already exist won't have this field set so we account for that in the search query for listing pipelines.
set id to the pipeline ID, which is also reflected in _id for backwards compatibility.

When creating ancillary objects, we:

set the type to the type of ancillary object, e.g. custom_grok_patterns, translate_filter_dictionary, etc.
populate the ancillary_config object with fields specific to the type of ancillary object.
Let _id be auto-popluated by Elasticsearch

Current versions of Logstash (x-pack-logstash) perform a GET .logstash/<pipeline-id> to retrieve a pipeline definition. This can continue to work as before. For ancillary objects, however, Logstash will need to perform a search query based on type and id.

Pros

No extra, sparse shards created
Security is already covered by logstash_admin role as it allows create, delete, index, read, and manage operations on .logstash*

Cons

Minor: the code (in x-pack-kibana and x-pack-logstash) handles pipelines and ancillary objects differently, which is a bit annoying from a code understanding and maintenance point of view.

Option 2: Create new `.logstash-ancillary-configs` index

We leave the .logstash index as-is and continue to use it as we do currently for storing pipeline configs. Additionally we create a .logstash-ancillary-configs (or better/shorter-named) index to hold ancillary config documents. This new index will have the following mapping:

{
  "dynamic": "strict"
  "properties": {
    "id": { "type": "keyword" }
    "description": { "type": "text" },
    "last_modified": { "type": "date" },
    "metadata": { "type": "object", "dynamic": "false" },
    "type": { "type": "keyword" }
    "ancillary_config": { "type": "object", "dynamic": "false" }
  }
}

Pros

Cleaner implementation: new mapping (via template), no need to worry about any impact to existing .logstash index usage or breaking BWC ever, code becomes easier to understand and maintain
Security is already covered by logstash_admin role as it allows create, delete, index, read, and manage operations on .logstash* (note the * at the end)

Cons

Once the user creates their first ancilliary config document, there will be 2 shards of this index in the cluster. They will likely be very sparse over the lifetime of this index.

Open Questions

How to reference centrally-managed ancillary pipeline configs in pipeline definitions while keeping backwards compatibility for referencing locally-managed ancillary pipeline configs? Some ideas:
1. Pseudo-protocol prefix like ccm://. Given that centralized config management is x-pack and many of the plugins that reference ancilary configs are open-source where would the parsing and resolution of such references live?
2. New options along side existing ones, e.g. in the grok plugin, ccm_patterns along side patterns_dir. Again, would the knowledge of this live in open-source plugins even though CCM is x-pack?
3. Fake paths. When CCM users create ancilary pipeline configs, they provide the ID in the form of a fake filesystem path, e.g. /tmp/ccm/patterns_dir/postfix. Logstash
4. Inlining. Is this always possible for all plugins?
5. Env var prefix. ${CCM_ANC_CONFS}/postfix_grok_patterns. X-pack-logstash then places files under CCM_ANC_CONFS and sets this env var.
What about binary files like GeoIP databases?
1. Base-64 encode them? Always, regardless of whether the file is binary or not?

Appendix

List of plugins that take options of type `path`

Thanks, Joao, for generating this list.

Plugin type	Plugin name	Config name	Description	Comment
codec	netflow	cache_save_path	Netflow template cache directory	Writeable path, cannot be centrally managed
codec	netflow	netflow_definitions	Override YAML file containing Netflow field definitions	Lookup file, YAML
codec	netflow	ipfix_definitions	Override YAML file containing IPFIX field definitions	Lookup file, YAML
filter	cidr	network_path	List of networks	Lookup file, `separator` delimited
filter	elasticsearch	ca_file	SSL Certificate Authority file	Is this safe to centrally manage?
filter	geoip	database	Path to Maxmind's database file	Lookup file, ?? format
filter	jdbc_static	jdbc_driver_library	JDBC driver library path to third party driver library. In case of multiple libraries being required you can pass them separated by a comma.	JAR? file
filter	jdbc_streaming	jdbc_driver_library	JDBC driver library path to third party driver library. In case of multiple libraries being required you can pass them separated by a comma.	JAR? file
filter	ruby	path	The path of the ruby script file that implements the `filter` method.	Ruby script file
filter	translate	dictionary_path	The full path of the external dictionary file.	YAML, JSON, or CSV file
input	beats	ssl_certificate
input	beats	ssl_key
input	couchdb_changes	ca_file
input	dead_letter_queue	path
input	elasticsearch	ca_file
input	google_pubsub	json_key_file	GCE Service Account JSON key file	JSON
input	http	keystore
input	jdbc	statement_filepath
input	jdbc	jdbc_password_filepath
input	kafka	ssl_truststore_location
input	kafka	ssl_keystore_location
input	kafka	jaas_path
input	kafka	kerberos_config
input	lumberjack	ssl_certificate
input	lumberjack	ssl_key
input	puppet_facter	public_key
input	puppet_facter	private_key
input	relp	ssl_cacert
input	relp	ssl_cert
input	relp	ssl_key
input	tcp	ssl_cacert
input	tcp	ssl_cert
input	tcp	ssl_key
filter	elasticsearch	ca_file
input	elasticsearch	ca_file
output	elasticsearch	template
output	elasticsearch	cacert
output	elasticsearch	truststore
output	elasticsearch	keystore
mixin	http_client	cacert
mixin	http_client	client_cert
mixin	http_client	client_key
mixin	http_client	keystore
mixin	http_client	truststore
mixin	rabbitmq_connection	ssl_certificate_path
mixin	rabbitmq_connection	tls_certificate_path
output	elasticsearch	template
output	elasticsearch	cacert
output	elasticsearch	truststore
output	elasticsearch	keystore
output	email	template_file
output	icinga	ca_file
output	kafka	ssl_truststore_location
output	kafka	ssl_keystore_location
output	kafka	jaas_path
output	kafka	kerberos_config
output	lumberjack	ssl_certificate
output	nagios_nsca	send_nsca_config
output	syslog	ssl_cacert
output	syslog	ssl_cert
output	syslog	ssl_key
output	tcp	ssl_cacert
output	tcp	ssl_cert
output	tcp	ssl_key
output	timber	cacert
output	timber	client_cert
output	timber	client_key
output	timber	keystore
output	timber	truststore

Original comment by @ycombinator:

@andrewvc @yaauie @original-brownbear As this proposal impacts Logstash core functionality, would you mind looking it over and providing feedback? Thank you!

Original comment by @ycombinator:

/cc @acchen97 @jordansissel

Original comment by @yaauie:

@ycombinator I appreciate your desire to come up with a solution for this; it's obviously one of the limiting factors in the current centralised-config setup.

🤔

Presently, each and every plugin that loads ancillary configuration files does so on its own, using either Java- or Ruby standard libraries, or by passing the given paths to their dependencies which do so; they are in effect bypassing Logstash-core and interacting with the filesystem directly.

Since that is the current state of the world, any solution that doesn't put the files on the filesystem will effectively need to be applied to each and every plugin individually as hand-crafted, bespoke, non-GMO patches. That's a pretty big cost.

That's every option except 1.v from the above "Open Questions" (--- assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change):

1.v: Env var prefix. ${CCM_ANC_CONFS}/postfix_grok_patterns. X-pack-logstash then places files under CCM_ANC_CONFS and sets this env var.

Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem; since the plugins are working with the filesystem directly (we have no opportunity for just-in-time retrieval unless we (A) provide that facility in Logstash-core and (B) apply the above-mentioned bespoke patches to each and every plugin)

While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer:

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

To me, it sounds like a good use-case for LINK REDACTED or LINK REDACTED 😩

Original comment by @ycombinator:

Thanks for the feedback, @yaauie.

assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change

I didn't follow why Logstash would need to expand the CCM_ANC_CONF environment variable. I was thinking Logstash would set this variable in the environment with a value that makes sense to Logstash --- i.e. a temporary folder perhaps?

Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;

Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).

While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer... To me, it sounds like a good use-case for rsync on cron or nfs 😩

It's more the case that we need to somehow get the ancillary files from the end-user to all (or some subset of) Logstash nodes. Since Kibana is the UI for centralized config management, and it talks to LS via docs in ES for pipeline configs, I thought it made sense to use the same mechanism for ancillary configs as well. What is the concern around using ES as a binary store for this purpose? Also, if we were to use rsync on cron or nfs how would that fit in with the centralized config UI being the starting point for the user to upload/create ancillary config files?

Original comment by @yaauie:

assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change

I didn't follow why Logstash would need to expand the CCM_ANC_CONF environment variable.

I think we're arguing the same point here, just using different terms.

Variable Expansion (or Parameter Expansion) is the process of taking a string, and replacing references to parameters with the values of those parameters; that is, given:

parameter CCM with value /tmp/logstash/ccm-conf
string ${CCM}/my_file.txt produce the string /tmp/logstash/ccm-conf/my_file.txt.

If Logstash doesn't expand the variable before instantiating the plugins (that is, replace the variable name reference in the given string with the variable's value), then the plugins would attempt to load a path with the literal string "${CCM}/foo", which would fail either because $, {, and } aren't legal in a filename path or because there is no file at that literal path (moreover, it would fail in unpredictable ways, because each plugin interacts with the filesystem in its own ways, e.g., a glob-type reference may simply silently return zero matches, while a File.open would likely fail more noisily).

Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;

Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).

Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String, but how this String is used (e.g., as a file's path) is entirely up to the plugin.

This prevents us from being able to auto-detect which files need to get re-downloaded when we load a plugin. It would be an all-or-nothing event, with a possible optimisation that we could set a time-based marker and only get Elasticsearch documents modified after that timestamp (but even this gets tricky and prone to race-conditions).

What is the concern around using ES as a binary store for [keeping these files in sync]?

Keeping files in sync across multiple machines isn't trivial; there will be a lot of overhead in keeping track of who has what version of what, and a lot of opportunities for race conditions and security vulnerabilities.
the *.jar files, e.g., (a) contain executable code and (b) can be tens of megabytes in size; what security and performance implications would we need to address? we would definitely need checksums, but would the IO impact Elasticsearch performance?
rsync, nfs, and friends are the cumulative effort of decades of work to solve the complexities of just this one set of problems; do we really think we'll nail it perfectly and quickly on our first go?

That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.

Original comment by @ycombinator:

I think we're arguing the same point here, just using different terms.

++. I think as long as Logstash can set CCM in the environment at a certain point before it expands that variable (along with others), I think this could work?

Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String, but how this String is used (e.g., as a file's path) is entirely up to the plugin.

Makes sense. What if the document that LS pulls from ES for a pipeline config also contained a field that listed all the CCM file references in that pipeline? Would LS then be able to pull down these files before (re-)starting the pipeline?

Keeping files in sync across multiple machines isn't trivial; there will be a lot of overhead in keeping track of who has what version of what, and a lot of opportunities for race conditions and security vulnerabilities.

Fair point but isn't this an issue with the centralized pipeline configs today as well?

the *.jar files, e.g., (a) contain executable code and (b) can be tens of megabytes in size; what security and performance implications would we need to address? we would definitely need checksums, but would the IO impact Elasticsearch performance?

Yeah, I'm tempted to draw the line at executable code -- meaning, we don't allow a generic file uploads as part of this feature, at least in an initial release. Instead we restrict ourselves to a few, specific types of non-executable ancillary configs like custom grok patterns and translate plugin dictionaries. Would that help mitigate some of the security and performance concerns?

rsync, nfs, and friends are the cumulative effort of decades of work to solve the complexities of just this one set of problems; do we really think we'll nail it perfectly and quickly on our first go? That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.

Yeah, this is a tough one -- On one hand I agree it would be unwise to try and reinvent a distributed filesystem AND get it right the first time. On the other hand we've already dipped our toes into this space with distributing pipeline config docs across many LS nodes. Ultimately I'll defer to your judgement on this one as it impacts LS core more than the UI.

I appreciate your detailed thoughts here. I certainly wouldn't want to invest time in a UI for this feature until we have a certain level of comfort and confidence on the core part of it. Thanks much!

Original comment by @pickypg:

I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC). The idea was always to simply orchestrate Logstash nodes from what I had heard.

This includes simpler ideas, like allowing users to add extra Grok patterns and Netflow definitions. I think we're getting sidetracked by discussing binary stores, including the GeoIP database. If they user wants to add a custom database, then I think it is fair that they install it with Logstash. On that note, it also seems remarkably dangerous to consider deploying Ruby code from Elasticsearch to any Logstash node that will listen, which is as bad as deploying arbitrary jar files.

My vote:

Index option 1 to extend and reuse the same index. I would just use the name "config".

As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.

Finally, I would ignore any option that requires the storage of binaries intended for some type of execution as well as strings intended for arbitrary execution; as far as I know, Ruby is not a safe scripting language and such an opening would allow a wide variety of critical paths to be vulnerable. Binary data is a different beast and Elasticsearch is not a bad place to store that (as a non-indexed field of course, which is what both options show). At the very least, that can be a future phase.

Original comment by @ycombinator:

I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC).

Ahh, yes, sorry — I should've been clearer: the appendix is more of an audit of what paths we have in Logstash plugins today, just to get an idea of what types of files we're looking at. It wasn't necessarily meant to be all the types of files we should support with this feature [EDIT: in an initial release anyway]!

As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.

++ I like this idea! [EDIT: This will still require (in my mind):

Logstash to choose a filesystem location where centrally-managed ancillary files should live,
Set CCM (or some other name) env. var to this filesystem location,
Grab the ancillary config content from the pipeline doc and write it to the filesystem location,
Expand CCM (as it does with any env vars) before handing it off to the plugin for execution.

Thoughts on this reduced-scope proposal, @yaauie?

Original comment by @yaauie:

I like the idea of ancillary configs being a part of an individual pipeline; while it does reduce the reusability of those ancillary configs slightly, it nicely limits the scope of what needs to be synchronised significantly.

This will still require (in my mind):

Logstash to choose a filesystem location where centrally-managed ancillary files should live,

Set CCM (or some other name) env. var to this filesystem location,

Grab the ancillary config content from the pipeline doc and write it to the filesystem location,

Expand CCM (as it does with any env vars) before handing it off to the plugin for execution. -- LINK REDACTED

In general, this makes sense to me; within Logstash, we could create a temporary directory upon each managed-pipeline reload, and populate it with ancillary config files from the config document we already fetched from Elasticsearch before registering the plugins. We'll also need to consider a cleanup phase (and the ability to opt-out of cleanup so we can debug troublesome setups).

Potential points for confusion:

user thinking that because they told Logstash "where centrally-managed ancillary files should live", that they can also muck with the contents of that directory and expect to get a two-way sync
user setting an environment variable with the same name externally and being surprised when it is overwritten with the path to the temporary directory we create for each pipeline
user gets confusing messages when the plugin has an issue with the expanded path failing to hold what it expects, since the plugin will error only with the expanded path.

To make its use/intent clear and easy to debug, both the environment variable name and resulting temporary data path should be tightly-linked; if/when a plugin has an issue, it will raise/log with the expanded path, so we need to have an obvious connection to the environment variable name as seen in the config. It should also indicate that it is for managed pipelines (which hopefully directs people to managed pipeline documentation).

To reduce the likelihood that a user attempts to "cross the streams" and use one pipeline's ancillary config from another pipeline, the environment variable name (and resulting path) should also clearly indicate that it represents a file-store for this pipeline.

With this in mind, what about the following?

environment variable: MANAGED_PIPELINE_FILES
path: ${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/, where
- TMP is a Logstash-configured or generated temporary data directory,
- PIPELINE_ID is a pipeline's external id (e.g., main), and
- PIPELINE_EPHEMERAL_ID is a pipeline's ephemeral id (e.g., a uuid, to prevent stale files from infecting a given pipeline run after reload).

Original comment by @ycombinator:

@yaauie That proposal makes sense to me. I like the scoping per-pipeline, per-ephemeral-ID, so we can debug which running instance of a pipeline used what ancillary state. 👍

I'll defer to the LS core folks on details but does it make sense to use the data folder (generally whatever path.data points to) instead of ${TMP) --- mostly so we have one place to look for any LS-managed state on disk? Again, that's a detail that is hidden from the UI code's POV so I'm good with whatever you folks think is best there.

With this, I think I have enough information now to start designing/prototyping a UI. I'm thinking:

The UI will (at least initially) allow users to create specific types of ancillary config files, probably starting out with custom grok patterns and translate filter lookup files. @acchen97 do these seem like a good starting point to you? we can always add more later if they make sense and are safe (see discussion in previous comments above). Users will have to give each of these a unique ID, at least unique within that type of config file, (e.g. 'my-custom-grok-patterns').
When a user saves an ancillary config file, Kibana will persist it into ES (in the .logstash index using the ancillary_config field and not populating the pipeline* fields). The _id of the document can be auto-generated by ES (see Technical design > Option 1 in the issue description for more details) . Since Logstash does a GET .logstash/<pipeline id> to retrieve specific pipeline documents, these ancillary config docs should have no effect on Logstash.
A user may then create/edit a pipeline from the UI as they do today. When they get to a grok filter they will be able to reference their ancillary config file in the patterns_dir option like so:
```
grok {
  patterns_dir => "MANAGED_PIPELINE_FILES/my-custom-grok-patterns"
  ...
}
```
Imagine something similar for the translate filter and its dictionary_path setting.
When the user saves this pipeline config, the UI code will parse it and determine ancillary config IDs referenced in the pipeline. It will then fetch those docs from ES, "inline" them into the pipeline document itself, say in a new field called pipeline_ancillary_configs, and persist the pipeline doc into .logstash. Logstash would then pick up this doc (as before), persist the inlined ancillary config content in pipeline_ancillary_configs to the path (as proposed by @yaauie), set the MANAGED_PIPELINE_FILES env var and run the pipeline.

Original comment by @andrewvc:

I love the way this discussion has gone so far, and it lines up with my expectations. Some thoughts

*. I would prefer Index Option 1 with a modification. I don't think it's ideal to rely on search to get these new documents vs the GET API. I propose that we namespace future configs with logstash-config-{config_name} and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.

WRT tmp data I assume we'll just use LINK REDACTED) from Java. The data directory is more for user generated data.
I'm in favor of storing JARs and other stuff in Elasticsearch, it actually works better than you might expect with the Elasticsearch binary type. I'd say a JAR is about the max though. I wouldn't store a 100MB file, but I would store a 10MB one.

Original comment by @ycombinator:

I don't think it's ideal to rely on search to get these new documents vs the GET API.

I'm curious: why is it less ideal to rely on _search with a term query on the new id field than to continue using the GET API like x-pack-logstash does today?

I propose that we namespace future configs with logstash-config-{config_name} and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.

++ to this. I assume you are okay with auto-generated _ids for ancillary config file documents, however?

Original comment by @andrewvc:

@ycombinator the primary problem with the _search API is that it isn't always instantly right, it usually is. GET is always realtime. You can force it via a _refresh, but better to avoid that complexity right?

So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.

Original comment by @ycombinator:

the primary problem with the _search API is that it isn't always instantly right, it usually is. GET is always realtime. You can force it via a _refresh, but better to avoid that complexity right?

Ah yes, of course, that buffer! Thanks for refreshing my memory on this 😄

So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.

Makes sense.

Original comment by @ycombinator:

I was just reminded that the translate filter has a refresh_interval option. Logstash re-reads the translate dictionary every refresh_interval seconds (defaulting to 300 seconds = 5 minutes) from disk into memory. I see some implications of this for centrally-managed ancillary configs, like a translate filter dictionary.

If a user updates the translate dictionary object in Kibana, should Kibana also update all pipeline documents using this object? If so, Logstash could do one of two things:

Write the translate dictionary object to disk (perhaps by first writing to a temporary file and then mving it into place so the write is atomic and the translate dictionary file is never empty while being read by Logstash), or
Create an entirely new pipeline instance (i.e. new ephemeral_id) that uses the new translate dictionary object, thereby reloading the pipeline.

Personally, I'd think option 2 is preferable but I'll defer to the Logstash core folks on this. Just wanted to raise this user story so we have a solution for it. Thoughts, @andrewvc @yaauie?

Original comment by @andrewvc:

@ycombinator great point! I actually think 1 is preferable. Reloading a pipeline can affect performance.

This also makes me think we should checksum all files that are uploaded as a way of checking what's local vs. remote.

We could possibly use document versions as well, but checksumming seems more foolproof.

Original comment by @ycombinator:

Between Elastic{ON}, EAH, my VTO, and other projects taking priority, this issue got moved to the backburner. I'm ready to work on it again now so I want to summarize the discussion and make a concrete proposal again for a (hopefully finally) review. After that we can break this proposal into individual issues in various repos and start working on them. So here goes...

Proposal

Kibana UI

Under Management > Logstash, we currently have a link labeled Pipelines; clicking this takes users to the CRUD UI for centrally managing Logstash Pipelines. We would add a Ancillary Files (final name might be different) link as a sibling to the Pipelines link.
Clicking on the Ancillary Files link would bring users to a CRUD UI for such files.
- Users would be able to add (upload) new ancillary config files. When adding new files, users will have to specify a unique id for their file so that it may be used as a reference in pipeline definitions, e.g. http_codes_map.
- We will restrict the file size to 10 MB (for the initial release of this feature; we can figure out a way to scale this up with multiple ES docs or something later on).
- Users would also have the ability to create folders to group related files, e.g. custom_grok_patterns or my_translate_dictionaries. Folders are especially necessary because certain plugins' settings want folder paths, not file paths (e.g. the patterns_dir setting of the grok plugin). However, it also gives users the ability to group related files together, which might be useful.
Over in the Pipelines CRUD UI, when users create or edit a pipeline, they will be able to reference any centrally-managed ancillary files or folders (depending on the plugin's demands) via their paths like so:
```
translate {
  dictionary_path => "${MANAGED_PIPELINE_FILES}/my_translate_dictionaries/http_codes_map"
  ...
}
```
or
```
grok {
  patterns_dir => "${MANAGED_PIPELINE_FILES}/custom_grok_patterns"
}
```

Persistence in Elasticsearch

When users create new ancillary config files via the UI, Kibana will store them as separate documents in the .logstash index. This will require creating some new fields in the .logstash mapping, notably type and id, to "make room" for documents representing ancillary config files since the index today is used exclusively for documents representing pipelines.
- Documents representing pipelines will continue to set the _id field to the pipeline ID as entered by users in the UI. This will help preserve backwards compatibility (more about this in the Backwards Compatibility section below). Additionally, such documents will set type to pipeline and id to the pipeline ID (same as the value of _id).
- Documents representing ancillary config files will set type to config (or ancillary_config since everything is a config, after all 😄?) and id to the ancillary config ID entered by the user in the UI. The _id will be auto-generated by ES. Additionally, ancillary config file documents will have a binary field, config_contents (or ancillary_config_contents) to hold the actual contents of the ancillary config file.
When a user makes such references in the pipeline definition, the UI code will parse out such references and a build up a list of files requires by the pipeline. It will then retrieve the contents of those files from the .logstash index and insert them into the pipeline's document in Elasticsearch, as a sibling of the pipeline definition itself, in a new field called pipeline_ancillary_configs.
- We will restrict users to 10 files per pipeline (for the initial release of this feature; we can figure out a way to scale this up with references or multiple ES docs or something later on; this coupled with the 10MB per file limit should keep the overall size of ancillary files for a pipeline <= 100MB, which is the upper bound for ES requests) so the pipeline document in ES doesn't get too large.

Logstash

When X-Pack Logstash performs a GET .logstash/<pipeline-id> request to retrieve a centrally-managed pipeline, it will also get any ancillary config files required by that pipeline as part of the response. X-Pack Logstash code will then:
1. Compute paths for each ancillary config file it just received. All files would follow this path structure: ${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/{ANCILLARY_CONFIG_FILE_PATH}, where
  - TMP is a Logstash-configured or generated temporary data directory,
  - PIPELINE_ID is a pipeline's external id (e.g., main), and
  - PIPELINE_EPHEMERAL_ID is a pipeline's ephemeral id (e.g., a uuid, to prevent stale files from infecting a given pipeline run after reload).
  - ANCILLARY_CONFIG_FILE_PATH is the path specified by the user in the UI, comprising any folders and the id of the ancillary config file.
2. Compare file paths and checksums to determine which files need creation/updating.
3. Create or atomically update (by first writing to a temporary file and then mving it into the final location) the files determined in the previous step.
4. Set the environment variable MANAGED_PIPELINE_FILES to the value ${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/.
5. Hand off the pipeline definition it just received from ES to OSS Logstash for (checksum comparison and) execution, as it does today.

Backwards compatibility

It is possible for users to get into a situation where they have older Logstashes (e.g. version 6.2.0) running against a newer version of Elasticsearch (e.g. version 6.4.0) that has the updated .logstash mapping and potentially documents in .logstash representing ancillary configs.

Such older Logstashes should continue to function without error unless a centrally-managed pipeline they're responsible for executing is updated to reference ancillary configs. This will cause the older Logstash to download this pipeline, which would contain ${MANAGED_PIPELINE_FILES} references in some of its plugins' settings, and then try to execute this pipeline. At that time, the MANAGED_PIPELINE_FILES environment variable would not be initialized and the pipeline would likely fail when the plugin in question tries to resolve the path in the relevant setting.

@andrewvc @yaauie @pickypg What do you think? Is this is a fair summary of the discussion so far or did I miss something?

Original comment by @ycombinator:

Related but largely orthogonal (IMO) issue: LINK REDACTED

Original comment by @andrewvc:

@ycombinator this looks great!

Question, the need to support directories seems like it's complicating the design quite a bit. If grok just took an array of file paths OR directories (and we deprecated patterns_dir) would that simplify things? That'd be a pretty easy change to make I think.

Original comment by @ycombinator:

@andrewvc It would definitely simplify things but I also don't know how bad directories would be, both in the UI and for Logstash. I'm also thinking that users will eventually want to organize all their files somehow and having directories for that purpose might be useful too. Let me take a crack at it and if it turns out to be a beast, we can consider the option you brought up.

As an alternative approach, can we consider leveraging the elastic cluster settings API? I am exploring a custom logstash filter that will call that API to access a section below the metadata and cache the results and periodically refresh the cache to avoid too frequent remote calls to es. This plugin will set metadata fields in the event based on the discovered rersponse.. As a start, we will use the es cluster put api and/or the kibana dev UI to define the config segttings. It's a rought idea, but we need a central config for logstash now.

elastic / kibana