Open elasticmachine opened 6 years ago
Original comment by @ycombinator:
@andrewvc @yaauie @original-brownbear As this proposal impacts Logstash core functionality, would you mind looking it over and providing feedback? Thank you!
Original comment by @ycombinator:
/cc @acchen97 @jordansissel
Original comment by @yaauie:
@ycombinator I appreciate your desire to come up with a solution for this; it's obviously one of the limiting factors in the current centralised-config setup.
π€
Presently, each and every plugin that loads ancillary configuration files does so on its own, using either Java- or Ruby standard libraries, or by passing the given paths to their dependencies which do so; they are in effect bypassing Logstash-core and interacting with the filesystem directly.
Since that is the current state of the world, any solution that doesn't put the files on the filesystem will effectively need to be applied to each and every plugin individually as hand-crafted, bespoke, non-GMO patches. That's a pretty big cost.
That's every option except 1.v
from the above "Open Questions" (--- assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change):
1.v: Env var prefix.
${CCM_ANC_CONFS}/postfix_grok_patterns
. X-pack-logstash then places files under CCM_ANC_CONFS and sets this env var.
Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem; since the plugins are working with the filesystem directly (we have no opportunity for just-in-time retrieval unless we (A) provide that facility in Logstash-core and (B) apply the above-mentioned bespoke patches to each and every plugin)
While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer:
I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.
To me, it sounds like a good use-case for LINK REDACTED or LINK REDACTED π©
Original comment by @ycombinator:
Thanks for the feedback, @yaauie.
assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change
I didn't follow why Logstash would need to expand the CCM_ANC_CONF
environment variable. I was thinking Logstash would set this variable in the environment with a value that makes sense to Logstash --- i.e. a temporary folder perhaps?
Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;
Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).
While I understand the desire for a seamless UX, I'm a little concerned that the proposal to use Elasticsearch to hold the binary data from all the necessary files is a bit like Maslow's Hammer... To me, it sounds like a good use-case for rsync on cron or nfs π©
It's more the case that we need to somehow get the ancillary files from the end-user to all (or some subset of) Logstash nodes. Since Kibana is the UI for centralized config management, and it talks to LS via docs in ES for pipeline configs, I thought it made sense to use the same mechanism for ancillary configs as well. What is the concern around using ES as a binary store for this purpose? Also, if we were to use rsync on cron or nfs how would that fit in with the centralized config UI being the starting point for the user to upload/create ancillary config files?
Original comment by @yaauie:
assuming that Logstash expands those environment variables before handing the values off to the plugin for initialisation; I'm unsure that it does, but that at least would be a single-place to make a change
I didn't follow why Logstash would need to expand the
CCM_ANC_CONF
environment variable.
I think we're arguing the same point here, just using different terms.
Variable Expansion (or Parameter Expansion) is the process of taking a string, and replacing references to parameters with the values of those parameters; that is, given:
CCM
with value /tmp/logstash/ccm-conf
${CCM}/my_file.txt
produce the string /tmp/logstash/ccm-conf/my_file.txt
.If Logstash doesn't expand the variable before instantiating the plugins (that is, replace the variable name reference in the given string with the variable's value), then the plugins would attempt to load a path with the literal string "${CCM}/foo"
, which would fail either because $
, {
, and }
aren't legal in a filename path or because there is no file at that literal path (moreover, it would fail in unpredictable ways, because each plugin interacts with the filesystem in its own ways, e.g., a glob-type reference may simply silently return zero matches, while a File.open
would likely fail more noisily).
Doing so would require that each Logstash, upon connecting to centralised management (and perhaps periodically thereafter), download all files to a directory on the local filesystem;
Yes, this is what I was thinking too. Logstash would need to download all the files locally upon initially connecting to centralized management, but also when a pipeline is to be restarted. The latter case would require Logstash to parse the pipeline config to determine which files might need re-downloading (in case they have changed).
Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String
, but how this String
is used (e.g., as a file's path) is entirely up to the plugin.
This prevents us from being able to auto-detect which files need to get re-downloaded when we load a plugin. It would be an all-or-nothing event, with a possible optimisation that we could set a time-based marker and only get Elasticsearch documents modified after that timestamp (but even this gets tricky and prone to race-conditions).
What is the concern around using ES as a binary store for [keeping these files in sync]?
*.jar
files, e.g., (a) contain executable code and (b) can be tens of megabytes in size; what security and performance implications would we need to address? we would definitely need checksums, but would the IO impact Elasticsearch performance?rsync
, nfs
, and friends are the cumulative effort of decades of work to solve the complexities of just this one set of problems; do we really think we'll nail it perfectly and quickly on our first go?That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.
Original comment by @ycombinator:
I think we're arguing the same point here, just using different terms.
++. I think as long as Logstash can set CCM
in the environment at a certain point before it expands that variable (along with others), I think this could work?
Logstash has no knowledge about the use of individual config parameters for any of the hundreds of plugins; all that Logstash knows is that it is handing off a String, but how this String is used (e.g., as a file's path) is entirely up to the plugin.
Makes sense. What if the document that LS pulls from ES for a pipeline config also contained a field that listed all the CCM file references in that pipeline? Would LS then be able to pull down these files before (re-)starting the pipeline?
Keeping files in sync across multiple machines isn't trivial; there will be a lot of overhead in keeping track of who has what version of what, and a lot of opportunities for race conditions and security vulnerabilities.
Fair point but isn't this an issue with the centralized pipeline configs today as well?
the
*.jar
files, e.g., (a) contain executable code and (b) can be tens of megabytes in size; what security and performance implications would we need to address? we would definitely need checksums, but would the IO impact Elasticsearch performance?
Yeah, I'm tempted to draw the line at executable code -- meaning, we don't allow a generic file uploads as part of this feature, at least in an initial release. Instead we restrict ourselves to a few, specific types of non-executable ancillary configs like custom grok patterns and translate plugin dictionaries. Would that help mitigate some of the security and performance concerns?
rsync
,nfs
, and friends are the cumulative effort of decades of work to solve the complexities of just this one set of problems; do we really think we'll nail it perfectly and quickly on our first go? That said, I don't have answers or suggestions on how to make this seamless. Creating an NFS volume, and mounting it on all hosts requires configuration, orchestration, and firewall profiles. Rsync too. I'm just a bit wary of reinventing distributed filesystems from first principles.
Yeah, this is a tough one -- On one hand I agree it would be unwise to try and reinvent a distributed filesystem AND get it right the first time. On the other hand we've already dipped our toes into this space with distributing pipeline config docs across many LS nodes. Ultimately I'll defer to your judgement on this one as it impacts LS core more than the UI.
I appreciate your detailed thoughts here. I certainly wouldn't want to invest time in a UI for this feature until we have a certain level of comfort and confidence on the core part of it. Thanks much!
Original comment by @pickypg:
I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC). The idea was always to simply orchestrate Logstash nodes from what I had heard.
This includes simpler ideas, like allowing users to add extra Grok patterns and Netflow definitions. I think we're getting sidetracked by discussing binary stores, including the GeoIP database. If they user wants to add a custom database, then I think it is fair that they install it with Logstash. On that note, it also seems remarkably dangerous to consider deploying Ruby code from Elasticsearch to any Logstash node that will listen, which is as bad as deploying arbitrary jar files.
My vote:
Index option 1 to extend and reuse the same index. I would just use the name "config".
As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.
Finally, I would ignore any option that requires the storage of binaries intended for some type of execution as well as strings intended for arbitrary execution; as far as I know, Ruby is not a safe scripting language and such an opening would allow a wide variety of critical paths to be vulnerable. Binary data is a different beast and Elasticsearch is not a bad place to store that (as a non-indexed field of course, which is what both options show). At the very least, that can be a future phase.
Original comment by @ycombinator:
I think that we should ignore any storage of executable binaries as part of this effort. As far as I am aware, there is no desire to store and transmit code on behalf of Logstash; it was just brought up because of the listing of paths that happened to be jars (e.g., for JDBC).
Ahh, yes, sorry β I should've been clearer: the appendix is more of an audit of what paths we have in Logstash plugins today, just to get an idea of what types of files we're looking at. It wasn't necessarily meant to be all the types of files we should support with this feature [EDIT: in an initial release anyway]!
As I have thought about this more, my other vote might be to tie this information to the pipeline itself and separately, which is how visualizations and dashboards are separated in Kibana. From there, any change to the configuration would be apply-able to the associated pipelines, but the user could just create a new pipeline and test it out without impacting an existing pipeline (the UI could then show, based on the config's hash, which pipelines were using it). I think that this would also simplify the Logstash side of the implementation by continuing to allow it to only fetch a single document.
++ I like this idea! [EDIT: This will still require (in my mind):
CCM
(or some other name) env. var to this filesystem location,CCM
(as it does with any env vars) before handing it off to the plugin for execution.Thoughts on this reduced-scope proposal, @yaauie?
Original comment by @yaauie:
I like the idea of ancillary configs being a part of an individual pipeline; while it does reduce the reusability of those ancillary configs slightly, it nicely limits the scope of what needs to be synchronised significantly.
This will still require (in my mind):
- Logstash to choose a filesystem location where centrally-managed ancillary files should live,
- Set
CCM
(or some other name) env. var to this filesystem location,- Grab the ancillary config content from the pipeline doc and write it to the filesystem location,
- Expand
CCM
(as it does with any env vars) before handing it off to the plugin for execution. -- LINK REDACTED
In general, this makes sense to me; within Logstash, we could create a temporary directory upon each managed-pipeline reload, and populate it with ancillary config files from the config document we already fetched from Elasticsearch before registering the plugins. We'll also need to consider a cleanup phase (and the ability to opt-out of cleanup so we can debug troublesome setups).
Potential points for confusion:
To make its use/intent clear and easy to debug, both the environment variable name and resulting temporary data path should be tightly-linked; if/when a plugin has an issue, it will raise/log with the expanded path, so we need to have an obvious connection to the environment variable name as seen in the config. It should also indicate that it is for managed pipelines (which hopefully directs people to managed pipeline documentation).
To reduce the likelihood that a user attempts to "cross the streams" and use one pipeline's ancillary config from another pipeline, the environment variable name (and resulting path) should also clearly indicate that it represents a file-store for this pipeline.
With this in mind, what about the following?
MANAGED_PIPELINE_FILES
${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/
, where
TMP
is a Logstash-configured or generated temporary data directory,PIPELINE_ID
is a pipeline's external id (e.g., main
), andPIPELINE_EPHEMERAL_ID
is a pipeline's ephemeral id (e.g., a uuid, to prevent stale files from infecting a given pipeline run after reload).Original comment by @ycombinator:
@yaauie That proposal makes sense to me. I like the scoping per-pipeline, per-ephemeral-ID, so we can debug which running instance of a pipeline used what ancillary state. π
I'll defer to the LS core folks on details but does it make sense to use the data
folder (generally whatever path.data
points to) instead of ${TMP)
--- mostly so we have one place to look for any LS-managed state on disk? Again, that's a detail that is hidden from the UI code's POV so I'm good with whatever you folks think is best there.
With this, I think I have enough information now to start designing/prototyping a UI. I'm thinking:
The UI will (at least initially) allow users to create specific types of ancillary config files, probably starting out with custom grok patterns and translate filter lookup files. @acchen97 do these seem like a good starting point to you? we can always add more later if they make sense and are safe (see discussion in previous comments above). Users will have to give each of these a unique ID, at least unique within that type of config file, (e.g. 'my-custom-grok-patterns').
When a user saves an ancillary config file, Kibana will persist it into ES (in the .logstash
index using the ancillary_config
field and not populating the pipeline*
fields). The _id
of the document can be auto-generated by ES (see Technical design > Option 1 in the issue description for more details) . Since Logstash does a GET .logstash/<pipeline id>
to retrieve specific pipeline documents, these ancillary config docs should have no effect on Logstash.
A user may then create/edit a pipeline from the UI as they do today. When they get to a grok
filter they will be able to reference their ancillary config file in the patterns_dir
option like so:
grok {
patterns_dir => "MANAGED_PIPELINE_FILES/my-custom-grok-patterns"
...
}
Imagine something similar for the translate
filter and its dictionary_path
setting.
When the user saves this pipeline config, the UI code will parse it and determine ancillary config IDs referenced in the pipeline. It will then fetch those docs from ES, "inline" them into the pipeline document itself, say in a new field called pipeline_ancillary_configs
, and persist the pipeline doc into .logstash
. Logstash would then pick up this doc (as before), persist the inlined ancillary config content in pipeline_ancillary_configs
to the path (as proposed by @yaauie), set the MANAGED_PIPELINE_FILES
env var and run the pipeline.
Original comment by @andrewvc:
I love the way this discussion has gone so far, and it lines up with my expectations. Some thoughts
*. I would prefer Index Option 1 with a modification. I don't think it's ideal to rely on search to get these new documents vs the GET API. I propose that we namespace future configs with logstash-config-{config_name}
and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.
WRT tmp data I assume we'll just use LINK REDACTED) from Java. The data directory is more for user generated data.
I'm in favor of storing JARs and other stuff in Elasticsearch, it actually works better than you might expect with the Elasticsearch binary
type. I'd say a JAR is about the max though. I wouldn't store a 100MB file, but I would store a 10MB one.
Original comment by @ycombinator:
I don't think it's ideal to rely on search to get these new documents vs the GET API.
I'm curious: why is it less ideal to rely on _search
with a term
query on the new id
field than to continue using the GET
API like x-pack-logstash does today?
I propose that we namespace future configs with
logstash-config-{config_name}
and have logstash x-pack try both paths for backward compatibility. Editing a config in Kibana would migrate it.
++ to this. I assume you are okay with auto-generated _id
s for ancillary config file documents, however?
Original comment by @andrewvc:
@ycombinator the primary problem with the _search
API is that it isn't always instantly right, it usually is. GET
is always realtime. You can force it via a _refresh
, but better to avoid that complexity right?
So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.
Original comment by @ycombinator:
the primary problem with the
_search API
is that it isn't always instantly right, it usually is. GET is always realtime. You can force it via a_refresh
, but better to avoid that complexity right?
Ah yes, of course, that buffer! Thanks for refreshing my memory on this π
So, there's a possibility of a race if an LS gets a new config it could pull a different doc. It's small, but I'd rather avoid it. I also think it'll make the code easier to maintain going forward, because document IDs are the only thing in ES that are like a primary key.
Makes sense.
Original comment by @ycombinator:
I was just reminded that the translate
filter has a refresh_interval
option. Logstash re-reads the translate dictionary every refresh_interval
seconds (defaulting to 300 seconds = 5 minutes) from disk into memory. I see some implications of this for centrally-managed ancillary configs, like a translate filter dictionary.
If a user updates the translate dictionary object in Kibana, should Kibana also update all pipeline documents using this object? If so, Logstash could do one of two things:
Write the translate dictionary object to disk (perhaps by first writing to a temporary file and then mv
ing it into place so the write is atomic and the translate dictionary file is never empty while being read by Logstash), or
Create an entirely new pipeline instance (i.e. new ephemeral_id
) that uses the new translate dictionary object, thereby reloading the pipeline.
Personally, I'd think option 2 is preferable but I'll defer to the Logstash core folks on this. Just wanted to raise this user story so we have a solution for it. Thoughts, @andrewvc @yaauie?
Original comment by @andrewvc:
@ycombinator great point! I actually think 1 is preferable. Reloading a pipeline can affect performance.
This also makes me think we should checksum all files that are uploaded as a way of checking what's local vs. remote.
We could possibly use document versions as well, but checksumming seems more foolproof.
Original comment by @ycombinator:
Between Elastic{ON}, EAH, my VTO, and other projects taking priority, this issue got moved to the backburner. I'm ready to work on it again now so I want to summarize the discussion and make a concrete proposal again for a (hopefully finally) review. After that we can break this proposal into individual issues in various repos and start working on them. So here goes...
Under Management > Logstash, we currently have a link labeled Pipelines; clicking this takes users to the CRUD UI for centrally managing Logstash Pipelines. We would add a Ancillary Files (final name might be different) link as a sibling to the Pipelines link.
Clicking on the Ancillary Files link would bring users to a CRUD UI for such files.
id
for their file so that it may be used as a reference in pipeline definitions, e.g. http_codes_map
.custom_grok_patterns
or my_translate_dictionaries
. Folders are especially necessary because certain plugins' settings want folder paths, not file paths (e.g. the patterns_dir
setting of the grok
plugin). However, it also gives users the ability to group related files together, which might be useful.Over in the Pipelines CRUD UI, when users create or edit a pipeline, they will be able to reference any centrally-managed ancillary files or folders (depending on the plugin's demands) via their paths like so:
translate {
dictionary_path => "${MANAGED_PIPELINE_FILES}/my_translate_dictionaries/http_codes_map"
...
}
or
grok {
patterns_dir => "${MANAGED_PIPELINE_FILES}/custom_grok_patterns"
}
When users create new ancillary config files via the UI, Kibana will store them as separate documents in the .logstash
index. This will require creating some new fields in the .logstash
mapping, notably type
and id
, to "make room" for documents representing ancillary config files since the index today is used exclusively for documents representing pipelines.
_id
field to the pipeline ID as entered by users in the UI. This will help preserve backwards compatibility (more about this in the Backwards Compatibility section below). Additionally, such documents will set type
to pipeline
and id
to the pipeline ID (same as the value of _id
).type
to config
(or ancillary_config
since everything is a config, after all π?) and id
to the ancillary config ID entered by the user in the UI. The _id
will be auto-generated by ES. Additionally, ancillary config file documents will have a binary
field, config_contents
(or ancillary_config_contents
) to hold the actual contents of the ancillary config file.When a user makes such references in the pipeline definition, the UI code will parse out such references and a build up a list of files requires by the pipeline. It will then retrieve the contents of those files from the .logstash
index and insert them into the pipeline's document in Elasticsearch, as a sibling of the pipeline definition itself, in a new field called pipeline_ancillary_configs
.
GET .logstash/<pipeline-id>
request to retrieve a centrally-managed pipeline, it will also get any ancillary config files required by that pipeline as part of the response. X-Pack Logstash code will then:
${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/{ANCILLARY_CONFIG_FILE_PATH}
, where
TMP
is a Logstash-configured or generated temporary data directory,PIPELINE_ID
is a pipeline's external id (e.g., main), andPIPELINE_EPHEMERAL_ID
is a pipeline's ephemeral id (e.g., a uuid, to prevent stale files from infecting a given pipeline run after reload).ANCILLARY_CONFIG_FILE_PATH
is the path specified by the user in the UI, comprising any folders and the id
of the ancillary config file.mv
ing it into the final location) the files determined in the previous step.MANAGED_PIPELINE_FILES
to the value ${TMP}/managed-pipeline-files/${PIPELINE_ID}-${PIPELINE_EPHEMERAL_ID}/
.It is possible for users to get into a situation where they have older Logstashes (e.g. version 6.2.0) running against a newer version of Elasticsearch (e.g. version 6.4.0) that has the updated .logstash
mapping and potentially documents in .logstash
representing ancillary configs.
Such older Logstashes should continue to function without error unless a centrally-managed pipeline they're responsible for executing is updated to reference ancillary configs. This will cause the older Logstash to download this pipeline, which would contain ${MANAGED_PIPELINE_FILES}
references in some of its plugins' settings, and then try to execute this pipeline. At that time, the MANAGED_PIPELINE_FILES
environment variable would not be initialized and the pipeline would likely fail when the plugin in question tries to resolve the path in the relevant setting.
@andrewvc @yaauie @pickypg What do you think? Is this is a fair summary of the discussion so far or did I miss something?
Original comment by @ycombinator:
Related but largely orthogonal (IMO) issue: LINK REDACTED
Original comment by @andrewvc:
@ycombinator this looks great!
Question, the need to support directories seems like it's complicating the design quite a bit. If grok just took an array of file paths OR directories (and we deprecated patterns_dir
) would that simplify things? That'd be a pretty easy change to make I think.
Original comment by @ycombinator:
@andrewvc It would definitely simplify things but I also don't know how bad directories would be, both in the UI and for Logstash. I'm also thinking that users will eventually want to organize all their files somehow and having directories for that purpose might be useful too. Let me take a crack at it and if it turns out to be a beast, we can consider the option you brought up.
As an alternative approach, can we consider leveraging the elastic cluster settings API? I am exploring a custom logstash filter that will call that API to access a section below the metadata and cache the results and periodically refresh the cache to avoid too frequent remote calls to es. This plugin will set metadata fields in the event based on the discovered rersponse.. As a start, we will use the es cluster put api and/or the kibana dev UI to define the config segttings. It's a rought idea, but we need a central config for logstash now.
Original comment by @ycombinator:
Motivation
Certain Logstash plugins in a pipeline configuration can accept references to ancillary configuration files. The plugins read these files and use their contents as part of the plugin's execution in the pipeline.
For example, users may define a custom grok pattern named
FOO
in a file namedpostfix
placed under the folder/tmp/custom_grok_patterns/
. They can then reference this folder and pattern in the grok filter like so:X-Pack Basic and above license users have the ability to centrally manage their Logstash pipeline configurations. Users can CRUD pipeline configurations in a Kibana Management UI and these configurations are (effectively) pushed out to Logstash nodes for execution.
However, users of centralized configuration management are unable to also centrally manage ancillary configuration files like custom grok patterns today. This proposal details how we might provide that capability.
User stories and corresponding UX
Sysadmin Sally wants to centrally manage some custom grok patterns useful for Postfix log processing
postfix_grok_patterns
, and populates the actual custom patterns as well, sayFOO
andBAR
. She saves the form, thereby creating the centrally-managed custom grok pattern collection.Data Analyst Dan wants to use the
FOO
custom grok pattern from thepostfix_grok_patterns
collection in his centrally-managed pipeline configurationWhen he reaches the grok filter definition, he references a centrally-managed custom pattern collection like so (exact syntax might need discussion; see open questions below):
Technical design
The current
.logstash
index was designed to hold pipeline config documents. The document IDs correspond to user-defined pipeline IDs. The mapping has top-level pipeline-specific fields,pipeline
andpipeline_metadata
.We could try to store ancilillary configs in the same
.logstash
index with some mapping changes. Or we could introduce a new.logstash-ancillary-configs
(or better/shorter-named :)) index. Details of both options, including pros and cons, are listed below.Option 1: Reuse
.logstash
indexFirst, we will need to "make room" for other types of documents in the
.logstash
index. This means adding a few new fields to the mapping. The new mapping would then look like this:Additionally, we'd also update the
logstash-index-template
index template with the above mapping.When creating/updating pipeline objects we do everything the same as now, notably:
_id
to the pipeline IDAdditionally, we:
type
topipeline
. Obviously pipelines that already exist won't have this field set so we account for that in the search query for listing pipelines.id
to the pipeline ID, which is also reflected in_id
for backwards compatibility.When creating ancillary objects, we:
type
to the type of ancillary object, e.g.custom_grok_patterns
,translate_filter_dictionary
, etc.ancillary_config
object with fields specific to the type of ancillary object._id
be auto-popluated by ElasticsearchCurrent versions of Logstash (
x-pack-logstash
) perform aGET .logstash/<pipeline-id>
to retrieve a pipeline definition. This can continue to work as before. For ancillary objects, however, Logstash will need to perform a search query based ontype
andid
.Pros
logstash_admin
role as it allowscreate
,delete
,index
,read
, andmanage
operations on.logstash*
Cons
x-pack-kibana
andx-pack-logstash
) handles pipelines and ancillary objects differently, which is a bit annoying from a code understanding and maintenance point of view.Option 2: Create new
.logstash-ancillary-configs
indexWe leave the
.logstash
index as-is and continue to use it as we do currently for storing pipeline configs. Additionally we create a.logstash-ancillary-configs
(or better/shorter-named) index to hold ancillary config documents. This new index will have the following mapping:Pros
.logstash
index usage or breaking BWC ever, code becomes easier to understand and maintainlogstash_admin
role as it allowscreate
,delete
,index
,read
, andmanage
operations on.logstash*
(note the*
at the end)Cons
Open Questions
How to reference centrally-managed ancillary pipeline configs in pipeline definitions while keeping backwards compatibility for referencing locally-managed ancillary pipeline configs? Some ideas:
ccm://
. Given that centralized config management is x-pack and many of the plugins that reference ancilary configs are open-source where would the parsing and resolution of such references live?grok
plugin,ccm_patterns
along sidepatterns_dir
. Again, would the knowledge of this live in open-source plugins even though CCM is x-pack?/tmp/ccm/patterns_dir/postfix
. Logstash${CCM_ANC_CONFS}/postfix_grok_patterns
. X-pack-logstash then places files underCCM_ANC_CONFS
and sets this env var.What about binary files like GeoIP databases?
Appendix
List of plugins that take options of type
path
Thanks, Joao, for generating this list.
separator
delimitedfilter
method.