elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.04k forks source link

[Logstash Centralized Config Management] Ancillary Configs #18119

Open elasticmachine opened 6 years ago

elasticmachine commented 6 years ago

Original comment by @ycombinator:

Motivation

Certain Logstash plugins in a pipeline configuration can accept references to ancillary configuration files. The plugins read these files and use their contents as part of the plugin's execution in the pipeline.

For example, users may define a custom grok pattern named FOO in a file named postfix placed under the folder /tmp/custom_grok_patterns/. They can then reference this folder and pattern in the grok filter like so:

grok {
  patterns_dir => [ "/tmp/custom_grok_patterns" ]
  match => { "message" => "... %{FOO} ..." }
}

X-Pack Basic and above license users have the ability to centrally manage their Logstash pipeline configurations. Users can CRUD pipeline configurations in a Kibana Management UI and these configurations are (effectively) pushed out to Logstash nodes for execution.

However, users of centralized configuration management are unable to also centrally manage ancillary configuration files like custom grok patterns today. This proposal details how we might provide that capability.

User stories and corresponding UX

  1. Sysadmin Sally wants to centrally manage some custom grok patterns useful for Postfix log processing

    1. Sally visits Kibana Management.
    2. Under Logstash, she clicks the (new) Ancillary Pipeline Configs (naming TBD) link.
    3. She is presented with a listing of various Ancillary Pipeline Config objects of different types (custom grok pattern collections, translate filter lookup collections, etc.).
    4. She clicks the New button on the page and chooses Custom Grok Pattern collection from the New button dropdown menu. [NOTE: Given the rich diversity of ancillary config file types (see Appendix below) we might want to offer a generic file upload option as well].
    5. She is presented with a form where she gives her Custom Grok Pattern collection an ID, say postfix_grok_patterns, and populates the actual custom patterns as well, say FOO and BAR. She saves the form, thereby creating the centrally-managed custom grok pattern collection.
  2. Data Analyst Dan wants to use the FOO custom grok pattern from the postfix_grok_patterns collection in his centrally-managed pipeline configuration

    1. Dan visits Kibana Management.
    2. Under Logstash, he clicks the Pipelines link.
    3. He is presented with a listing of various Pipelines.
    4. He clicks the New button on the page and starts to create his pipeline configuration.
    5. When he reaches the grok filter definition, he references a centrally-managed custom pattern collection like so (exact syntax might need discussion; see open questions below):

      grok {
       patterns_dir => [ "ccm://postfix_grok_patterns" ]
       match => { "message" => "... %{FOO} ..." 
      }

Technical design

The current .logstash index was designed to hold pipeline config documents. The document IDs correspond to user-defined pipeline IDs. The mapping has top-level pipeline-specific fields, pipeline and pipeline_metadata.

We could try to store ancilillary configs in the same .logstash index with some mapping changes. Or we could introduce a new .logstash-ancillary-configs (or better/shorter-named :)) index. Details of both options, including pros and cons, are listed below.

Option 1: Reuse .logstash index

First, we will need to "make room" for other types of documents in the .logstash index. This means adding a few new fields to the mapping. The new mapping would then look like this:

{
  "dynamic": "strict" // same as before
  "properties": {
    "description": { "type": "text" }, // same as before
    "last_modified": { "type": "date" }, // same as before
    "metadata": { "type": "object", "dynamic": "false" }, // same as before
    "pipeline": { "type": "text" }, // same as before
    "pipeline_metadata": { // same as before
      "properties": {
        "type": { "type": "keyword" },
        "version": { "type": "short" },
        "username": { "type": "keyword" }
      }
    },
    "id": { "type": "keyword" } // NEW
    "type": { "type": "keyword" } // NEW
    "ancillary_config": { "type": "object", "dynamic": "false" }// NEW
  }
}

Additionally, we'd also update the logstash-index-template index template with the above mapping.

When creating/updating pipeline objects we do everything the same as now, notably:

Additionally, we:

When creating ancillary objects, we:

Current versions of Logstash (x-pack-logstash) perform a GET .logstash/<pipeline-id> to retrieve a pipeline definition. This can continue to work as before. For ancillary objects, however, Logstash will need to perform a search query based on type and id.

Pros

Cons

Option 2: Create new .logstash-ancillary-configs index

We leave the .logstash index as-is and continue to use it as we do currently for storing pipeline configs. Additionally we create a .logstash-ancillary-configs (or better/shorter-named) index to hold ancillary config documents. This new index will have the following mapping:

{
  "dynamic": "strict"
  "properties": {
    "id": { "type": "keyword" }
    "description": { "type": "text" },
    "last_modified": { "type": "date" },
    "metadata": { "type": "object", "dynamic": "false" },
    "type": { "type": "keyword" }
    "ancillary_config": { "type": "object", "dynamic": "false" }
  }
}

Pros

Cons

Open Questions

  1. How to reference centrally-managed ancillary pipeline configs in pipeline definitions while keeping backwards compatibility for referencing locally-managed ancillary pipeline configs? Some ideas:

    1. Pseudo-protocol prefix like ccm://. Given that centralized config management is x-pack and many of the plugins that reference ancilary configs are open-source where would the parsing and resolution of such references live?
    2. New options along side existing ones, e.g. in the grok plugin, ccm_patterns along side patterns_dir. Again, would the knowledge of this live in open-source plugins even though CCM is x-pack?
    3. Fake paths. When CCM users create ancilary pipeline configs, they provide the ID in the form of a fake filesystem path, e.g. /tmp/ccm/patterns_dir/postfix. Logstash
    4. Inlining. Is this always possible for all plugins?
    5. Env var prefix. ${CCM_ANC_CONFS}/postfix_grok_patterns. X-pack-logstash then places files under CCM_ANC_CONFS and sets this env var.
  2. What about binary files like GeoIP databases?

    1. Base-64 encode them? Always, regardless of whether the file is binary or not?

Appendix

List of plugins that take options of type path

Thanks, Joao, for generating this list.

Plugin type Plugin name Config name Description Comment
codec netflow cache_save_path Netflow template cache directory Writeable path, cannot be centrally managed
codec netflow netflow_definitions Override YAML file containing Netflow field definitions Lookup file, YAML
codec netflow ipfix_definitions Override YAML file containing IPFIX field definitions Lookup file, YAML
filter cidr network_path List of networks Lookup file, separator delimited
filter elasticsearch ca_file SSL Certificate Authority file Is this safe to centrally manage?
filter geoip database Path to Maxmind's database file Lookup file, ?? format
filter jdbc_static jdbc_driver_library JDBC driver library path to third party driver library. In case of multiple libraries being required you can pass them separated by a comma. JAR? file
filter jdbc_streaming jdbc_driver_library JDBC driver library path to third party driver library. In case of multiple libraries being required you can pass them separated by a comma. JAR? file
filter ruby path The path of the ruby script file that implements the filter method. Ruby script file
filter translate dictionary_path The full path of the external dictionary file. YAML, JSON, or CSV file
input beats ssl_certificate
input beats ssl_key
input couchdb_changes ca_file
input dead_letter_queue path
input elasticsearch ca_file
input google_pubsub json_key_file GCE Service Account JSON key file JSON
input http keystore
input jdbc statement_filepath
input jdbc jdbc_password_filepath
input kafka ssl_truststore_location
input kafka ssl_keystore_location
input kafka jaas_path
input kafka kerberos_config
input lumberjack ssl_certificate
input lumberjack ssl_key
input puppet_facter public_key
input puppet_facter private_key
input relp ssl_cacert
input relp ssl_cert
input relp ssl_key
input tcp ssl_cacert
input tcp ssl_cert
input tcp ssl_key
filter elasticsearch ca_file
input elasticsearch ca_file
output elasticsearch template
output elasticsearch cacert
output elasticsearch truststore
output elasticsearch keystore
mixin http_client cacert
mixin http_client client_cert
mixin http_client client_key
mixin http_client keystore
mixin http_client truststore
mixin rabbitmq_connection ssl_certificate_path
mixin rabbitmq_connection tls_certificate_path
output elasticsearch template
output elasticsearch cacert
output elasticsearch truststore
output elasticsearch keystore
output email template_file
output icinga ca_file
output kafka ssl_truststore_location
output kafka ssl_keystore_location
output kafka jaas_path
output kafka kerberos_config
output lumberjack ssl_certificate
output nagios_nsca send_nsca_config
output syslog ssl_cacert
output syslog ssl_cert
output syslog ssl_key
output tcp ssl_cacert
output tcp ssl_cert
output tcp ssl_key
output timber cacert
output timber client_cert
output timber client_key
output timber keystore
output timber truststore
elasticmachine commented 6 years ago

Original comment by @ycombinator:

@andrewvc @yaauie @original-brownbear As this proposal impacts Logstash core functionality, would you mind looking it over and providing feedback? Thank you!

elasticmachine commented 6 years ago

Original comment by @ycombinator:

/cc @acchen97 @jordansissel

elasticmachine commented 6 years ago

Original comment by @yaauie:

@ycombinator I appreciate your desire to come up with a solution for this; it's obviously one of the limiting factors in the current centralised-config setup.

πŸ€”

Presently, each and every plugin that loads ancillary configuration files does so on its own, using either Java- or Ruby standard libraries, or by passing the given paths to their dependencies which do so; they are in effect bypassing Logstash-core and interacting with the filesystem directly.

Since that is the current state of the world, any solution that doesn't put the files on the filesystem will effectively need to be applied to each and every plugin individually as hand-crafted, bespoke, non-GMO patches. That's a pretty big cost.

That's every option except 1.v from the above "Open Questions" (--- assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change):

1.v: Env var prefix. ${CCM_ANC_CONFS}/postfix_grok_patterns. X-pack-logstash then places files under CCM_ANC_CONFS and sets this env var.

Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem; since the plugins are working with the filesystem directly (we have no opportunity for just-in-time retrieval unless we (A) provide that facility in Logstash-core and (B) apply the above-mentioned bespoke patches to each and every plugin)


While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer:

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

To me, it sounds like a good use-case for LINK REDACTED or LINK REDACTED 😩

elasticmachine commented 6 years ago

Original comment by @ycombinator:

Thanks for the feedback, @yaauie.

assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change

I didn't follow why Logstash would need to expand the CCM_ANC_CONF environment variable. I was thinking Logstash would set this variable in the environment with a value that makes sense to Logstash --- i.e. a temporary folder perhaps?

Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;

Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).

While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer... To me, it sounds like a good use-case for rsync on cron or nfs 😩

It's more the case that we need to somehow get the ancillary files from the end-user to all (or some subset of) Logstash nodes. Since Kibana is the UI for centralized config management, and it talks to LS via docs in ES for pipeline configs, I thought it made sense to use the same mechanism for ancillary configs as well. What is the concern around using ES as a binary store for this purpose? Also, if we were to use rsync on cron or nfs how would that fit in with the centralized config UI being the starting point for the user to upload/create ancillary config files?

elasticmachine commented 6 years ago

Original comment by @yaauie:

assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change

I didn't follow why Logstash would need to expand the CCM_ANC_CONF environment variable.

I think we're arguing the same point here, just using different terms.

Variable Expansion (or Parameter Expansion) is the process of taking a string, and replacing references to parameters with the values of those parameters; that is, given:

If Logstash doesn't expand the variable before instantiating the plugins (that is, replace the variable name reference in the given string with the variable's value), then the plugins would attempt to load a path with the literal string "${CCM}/foo", which would fail either because $, {, and } aren't legal in a filename path or because there is no file at that literal path (moreover, it would fail in unpredictable ways, because each plugin interacts with the filesystem in its own ways, e.g., a glob-type reference may simply silently return zero matches, while a File.open would likely fail more noisily).


Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;

Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).

Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String, but how this String is used (e.g., as a file's path) is entirely up to the plugin.

This prevents us from being able to auto-detect which files need to get re-downloaded when we load a plugin. It would be an all-or-nothing event, with a possible optimisation that we could set a time-based marker and only get Elasticsearch documents modified after that timestamp (but even this gets tricky and prone to race-conditions).


What is the concern around using ES as a binary store for [keeping these files in sync]?

That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

I think we're arguing the same point here, just using different terms.

++. I think as long as Logstash can set CCM in the environment at a certain point before it expands that variable (along with others), I think this could work?

Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String, but how this String is used (e.g., as a file's path) is entirely up to the plugin.

Makes sense. What if the document that LS pulls from ES for a pipeline config also contained a field that listed all the CCM file references in that pipeline? Would LS then be able to pull down these files before (re-)starting the pipeline?

Keeping files in sync across multiple machines isn't trivial; there will be a lot of overhead in keeping track of who has what version of what, and a lot of opportunities for race conditions and security vulnerabilities.

Fair point but isn't this an issue with the centralized pipeline configs today as well?

the *.jar files, e.g., (a) contain executable code and (b) can be tens of megabytes in size; what security and performance implications would we need to address? we would definitely need checksums, but would the IO impact Elasticsearch performance?

Yeah, I'm tempted to draw the line at executable code -- meaning, we don't allow a generic file uploads as part of this feature, at least in an initial release. Instead we restrict ourselves to a few, specific types of non-executable ancillary configs like custom grok patterns and translate plugin dictionaries. Would that help mitigate some of the security and performance concerns?

rsync, nfs, and friends are the cumulative effort of decades of work to solve the complexities of just this one set of problems; do we really think we'll nail it perfectly and quickly on our first go? That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.

Yeah, this is a tough one -- On one hand I agree it would be unwise to try and reinvent a distributed filesystem AND get it right the first time. On the other hand we've already dipped our toes into this space with distributing pipeline config docs across many LS nodes. Ultimately I'll defer to your judgement on this one as it impacts LS core more than the UI.

I appreciate your detailed thoughts here. I certainly wouldn't want to invest time in a UI for this feature until we have a certain level of comfort and confidence on the core part of it. Thanks much!

elasticmachine commented 6 years ago

Original comment by @pickypg:

I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC). The idea was always to simply orchestrate Logstash nodes from what I had heard.

This includes simpler ideas, like allowing users to add extra Grok patterns and Netflow definitions. I think we're getting sidetracked by discussing binary stores, including the GeoIP database. If they user wants to add a custom database, then I think it is fair that they install it with Logstash. On that note, it also seems remarkably dangerous to consider deploying Ruby code from Elasticsearch to any Logstash node that will listen, which is as bad as deploying arbitrary jar files.

My vote:

Index option 1 to extend and reuse the same index. I would just use the name "config".

As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.

Finally, I would ignore any option that requires the storage of binaries intended for some type of execution as well as strings intended for arbitrary execution; as far as I know, Ruby is not a safe scripting language and such an opening would allow a wide variety of critical paths to be vulnerable. Binary data is a different beast and Elasticsearch is not a bad place to store that (as a non-indexed field of course, which is what both options show). At the very least, that can be a future phase.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC).

Ahh, yes, sorry β€” I should've been clearer: the appendix is more of an audit of what paths we have in Logstash plugins today, just to get an idea of what types of files we're looking at. It wasn't necessarily meant to be all the types of files we should support with this feature [EDIT: in an initial release anyway]!

As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.

++ I like this idea! [EDIT: This will still require (in my mind):

Thoughts on this reduced-scope proposal, @yaauie?

elasticmachine commented 6 years ago

Original comment by @yaauie:

I like the idea of ancillary configs being a part of an individual pipeline; while it does reduce the reusability of those ancillary configs slightly, it nicely limits the scope of what needs to be synchronised significantly.

This will still require (in my mind):

  • Logstash to choose a filesystem location where centrally-managed ancillary files should live,
  • Set CCM (or some other name) env. var to this filesystem location,
  • Grab the ancillary config content from the pipeline doc and write it to the filesystem location,
  • Expand CCM (as it does with any env vars) before handing it off to the plugin for execution. -- LINK REDACTED

In general, this makes sense to me; within Logstash, we could create a temporary directory upon each managed-pipeline reload, and populate it with ancillary config files from the config document we already fetched from Elasticsearch before registering the plugins. We'll also need to consider a cleanup phase (and the ability to opt-out of cleanup so we can debug troublesome setups).

Potential points for confusion:


To make its use/intent clear and easy to debug, both the environment variable name and resulting temporary data path should be tightly-linked; if/when a plugin has an issue, it will raise/log with the expanded path, so we need to have an obvious connection to the environment variable name as seen in the config. It should also indicate that it is for managed pipelines (which hopefully directs people to managed pipeline documentation).

To reduce the likelihood that a user attempts to "cross the streams" and use one pipeline's ancillary config from another pipeline, the environment variable name (and resulting path) should also clearly indicate that it represents a file-store for this pipeline.

With this in mind, what about the following?

elasticmachine commented 6 years ago

Original comment by @ycombinator:

@yaauie That proposal makes sense to me. I like the scoping per-pipeline, per-ephemeral-ID, so we can debug which running instance of a pipeline used what ancillary state. πŸ‘

I'll defer to the LS core folks on details but does it make sense to use the data folder (generally whatever path.data points to) instead of ${TMP) --- mostly so we have one place to look for any LS-managed state on disk? Again, that's a detail that is hidden from the UI code's POV so I'm good with whatever you folks think is best there.


With this, I think I have enough information now to start designing/prototyping a UI. I'm thinking:

elasticmachine commented 6 years ago

Original comment by @andrewvc:

I love the way this discussion has gone so far, and it lines up with my expectations. Some thoughts

*. I would prefer Index Option 1 with a modification. I don't think it's ideal to rely on search to get these new documents vs the GET API. I propose that we namespace future configs with logstash-config-{config_name} and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

I don't think it's ideal to rely on search to get these new documents vs the GET API.

I'm curious: why is it less ideal to rely on _search with a term query on the new id field than to continue using the GET API like x-pack-logstash does today?

I propose that we namespace future configs with logstash-config-{config_name} and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.

++ to this. I assume you are okay with auto-generated _ids for ancillary config file documents, however?

elasticmachine commented 6 years ago

Original comment by @andrewvc:

@ycombinator the primary problem with the _search API is that it isn't always instantly right, it usually is. GET is always realtime. You can force it via a _refresh, but better to avoid that complexity right?

So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

the primary problem with the _search API is that it isn't always instantly right, it usually is. GET is always realtime. You can force it via a _refresh, but better to avoid that complexity right?

Ah yes, of course, that buffer! Thanks for refreshing my memory on this πŸ˜„

So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.

Makes sense.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

I was just reminded that the translate filter has a refresh_interval option. Logstash re-reads the translate dictionary every refresh_interval seconds (defaulting to 300 seconds = 5 minutes) from disk into memory. I see some implications of this for centrally-managed ancillary configs, like a translate filter dictionary.

If a user updates the translate dictionary object in Kibana, should Kibana also update all pipeline documents using this object? If so, Logstash could do one of two things:

  1. Write the translate dictionary object to disk (perhaps by first writing to a temporary file and then mving it into place so the write is atomic and the translate dictionary file is never empty while being read by Logstash), or

  2. Create an entirely new pipeline instance (i.e. new ephemeral_id) that uses the new translate dictionary object, thereby reloading the pipeline.

Personally, I'd think option 2 is preferable but I'll defer to the Logstash core folks on this. Just wanted to raise this user story so we have a solution for it. Thoughts, @andrewvc @yaauie?

elasticmachine commented 6 years ago

Original comment by @andrewvc:

@ycombinator great point! I actually think 1 is preferable. Reloading a pipeline can affect performance.

This also makes me think we should checksum all files that are uploaded as a way of checking what's local vs. remote.

We could possibly use document versions as well, but checksumming seems more foolproof.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

Between Elastic{ON}, EAH, my VTO, and other projects taking priority, this issue got moved to the backburner. I'm ready to work on it again now so I want to summarize the discussion and make a concrete proposal again for a (hopefully finally) review. After that we can break this proposal into individual issues in various repos and start working on them. So here goes...

Proposal

Kibana UI

Persistence in Elasticsearch

Logstash

Backwards compatibility

It is possible for users to get into a situation where they have older Logstashes (e.g. version 6.2.0) running against a newer version of Elasticsearch (e.g. version 6.4.0) that has the updated .logstash mapping and potentially documents in .logstash representing ancillary configs.

Such older Logstashes should continue to function without error unless a centrally-managed pipeline they're responsible for executing is updated to reference ancillary configs. This will cause the older Logstash to download this pipeline, which would contain ${MANAGED_PIPELINE_FILES} references in some of its plugins' settings, and then try to execute this pipeline. At that time, the MANAGED_PIPELINE_FILES environment variable would not be initialized and the pipeline would likely fail when the plugin in question tries to resolve the path in the relevant setting.


@andrewvc @yaauie @pickypg What do you think? Is this is a fair summary of the discussion so far or did I miss something?

elasticmachine commented 6 years ago

Original comment by @ycombinator:

Related but largely orthogonal (IMO) issue: LINK REDACTED

elasticmachine commented 6 years ago

Original comment by @andrewvc:

@ycombinator this looks great!

Question, the need to support directories seems like it's complicating the design quite a bit. If grok just took an array of file paths OR directories (and we deprecated patterns_dir) would that simplify things? That'd be a pretty easy change to make I think.

elasticmachine commented 6 years ago

Original comment by @ycombinator:

@andrewvc It would definitely simplify things but I also don't know how bad directories would be, both in the UI and for Logstash. I'm also thinking that users will eventually want to organize all their files somehow and having directories for that purpose might be useful too. Let me take a crack at it and if it turns out to be a beast, we can consider the option you brought up.

richard-mauri commented 3 years ago

As an alternative approach, can we consider leveraging the elastic cluster settings API? I am exploring a custom logstash filter that will call that API to access a section below the metadata and cache the results and periodically refresh the cache to avoid too frequent remote calls to es. This plugin will set metadata fields in the event based on the discovered rersponse.. As a start, we will use the es cluster put api and/or the kibana dev UI to define the config segttings. It's a rought idea, but we need a central config for logstash now.