elastic / ecs-logging

ECS Logging - Common resources and issues for the language specific ECS loggers
https://www.elastic.co/guide/en/ecs-logging/overview/master/intro.html
Apache License 2.0
42 stars 16 forks source link

Add data_stream fields to spec #38

Open felixbarny opened 3 years ago

felixbarny commented 3 years ago

Data streams are a new way of index management that is used by Elastic Agent. See also https://github.com/elastic/ecs/pull/1145. To enable users to influence the data stream (and consequently, the index) the data is sent to, ECS loggers should offer configuration options for data_stream.dataset and data_stream.namespace. The data stream for logs is logs-{data_stream.dataset}-{data_stream.namespace}. As the settings are optional, Filebeat will default the values to default so that the data stream would be logs-generic-default.

The event.dataset field is going to be deprecated in favor of data_stream.dataset and scheduled to be removed for 8.0.

As we're planning to GA Elastic Agent in the 7.x timeframe, we'll have to support both event.dataset and data_stream.dataset for a while. Therefore, ECS loggers should set both fields with the same value, ideally before their initial GA release. This will make 7.x ECS loggers compatible with the 8.x stack.

When 8.0 comes along, ECS loggers should drop support for the event.dataset field. This would still allow 8.x loggers to be used with a 7.x stack. The only implication would be that the ML categorization job, which relies on event.dataset, wouldn't work. To fix that, users can use an ingest node processor to copy the value of data_stream.dataset to event.dataset, or downgrade their ECS logger to 7.x.

ebeahan commented 3 years ago

The event.dataset field is going to be deprecated in favor of data_stream.dataset and scheduled to be removed for 8.0.

@felixbarny Deprecating event.dataset is a topic discussed previously, but I'm not aware of a finalized decision from the ECS side.

The current guidance included in the data_stream fields RFC notes that event.dataset and data_stream.dataset should use the same value: https://github.com/elastic/ecs/blob/master/rfcs/text/0009-data_stream-fields.md#L123

ruflin commented 3 years ago

@felixbarny I would also assume that event.dataset and data_stream.dataset will coexist.

felixbarny commented 3 years ago

I would also assume that event.dataset and data_stream.dataset will coexist.

That's how I specified it. Both event.dataset and data_stream.dataset will be set by the loggers. But starting with version 8.0 of the loggers, we can stop sending event.dataset if ECS drops the field in 8.0 and if the logs UI uses data_stream.dataset for the log rate anomaly detection.

Default for data_stream.dataset is generic

I've updated the spec. But to be clear: the loggers would not set data_stream.dataset=generic by default. The default for the loggers is to not add the data_stream.dataset field. Instead, Filebeat will apply the defaults if they're missing.

ruflin commented 3 years ago

I'm torn if it should be set by default or not. The advantage of setting it also in the log is even if an other tool picks the logs up, it will contain all the data needed. But adds to each log line :-(

felixbarny commented 3 years ago

I really think it's the responsibility of the "other tool" to add the defaults. It already needs to add metadata about the host, pod, container, etc anyway.

weltenwort commented 3 years ago

To add some context from the Logs UI side:

It uses event.dataset for the foreseeable future, because it supports log entries in plain indices and in data streams. If storage size is a concern, it should work fine if event.dataset was defined as an alias for data_stream.dataset.

felixbarny commented 3 years ago

@ruflin is adding event.dataset as an alias for data_stream.dataset something you're considering in the index templates created by Elastic Agent?

Should ecs loggers add both event.dataset and `data_stream.dataset fields?

The advantage is that this would work with the widest range of Filebeat and the metrics UI.

The downside is that there's a duplication of data. Also, if event.dataset is defined as an alias for data_stream.dataset, it might cause issues if the log records also contain event.dataset. Maybe newer versions of Filebeat should just drop event.dataset if they see that data_stream.dataset is present?

ruflin commented 3 years ago

Elastic Agent does not have any templates, I assume you are referring to the templates shipped by Elasticsearch? As you mentioned, there is a problem around event.dataset already existing. Filebeat is the only shipper that we control, there might be others.

Also there are the parts which don't use the new indexing strategy yet where still event.dataset is used. Maybe that could be the differentiator? For indices which are logs-*-* data_stream.dataset is used, for everytihng else it falls back to event.dataset?

felixbarny commented 3 years ago

Elastic Agent does not have any templates, I assume you are referring to the templates shipped by Elasticsearch?

I was referring to the templates shipped with Filebeat that are installed by Elastic Agent, IIUC.

For indices which are logs-- data_stream.dataset is used, for everytihng else it falls back to event.dataset?

But where should that decision be made? The ECS loggers don't and shouldn't know who's shipping the logs.

If ECS loggers add data_stream fields, non-datastream-aware shippers (such as older Filebeat versions) would just add them to the index, resulting in unnecessary keyword fields.

Maybe ECS loggers should just add event.dataset and newer versions of Filebeat should add a data_stream.dataset field with the same value? But that means users wouldn't be able to set data_stream.namespace in their logging config. That would need to be done in the Filebeat config.

To serve the log UI's log rate anomaly ML job, even datastream-based logs-*-* indices need the event.dataset, it seems.

ruflin commented 3 years ago

The templates are loaded by Elasticsearch directly. Here is an open PR to modify these, it should indicate where these are: https://github.com/elastic/elasticsearch/pull/64978

Agree, ECS loggers should not care who ships the logs. What if we use the data_stream.dataset field only? To keep it compatible with older event.dataset it would be up to the collector or event Logs UI / ML job to add an alias / runtime fields.

Lets try to discuss first the end goal and then work backwards from there what we can do to mitigate the issues that happen short and midterm.

apmmachine commented 3 years ago

:broken_heart: Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

#### Build stats * Start Time: 2021-08-30T04:31:00.402+0000 * Duration: 3 min 53 sec * Commit: 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c #### Trends :test_tube: ![Image of Build Times](https://apm-ci.elastic.co/job/apm-shared/job/ecs-logging-update-specs-mbp/job/PR-38/buildTimeGraph/png)

Steps errors 1

Expand to view the steps failures

##### `Read yaml from files in the workspace or text.`

  • Took 0 min 0 sec . View more details on here
  • Description: .ci/.jenkins-loggers.yml

Log output

Expand to view the last 100 lines of log output

``` [2021-08-30T04:33:27.579Z] > git init /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38 # timeout=10 [2021-08-30T04:33:27.643Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git [2021-08-30T04:33:27.643Z] > git --version # timeout=10 [2021-08-30T04:33:27.648Z] > git --version # 'git version 2.17.1' [2021-08-30T04:33:27.648Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key [2021-08-30T04:33:27.719Z] > git fetch --no-tags --progress -- git@github.com:elastic/ecs-logging.git +refs/heads/*:refs/remotes/origin/* # timeout=15 [2021-08-30T04:33:28.352Z] > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10 [2021-08-30T04:33:28.363Z] > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 [2021-08-30T04:33:28.372Z] > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10 [2021-08-30T04:33:28.379Z] > git rev-parse --verify HEAD # timeout=10 [2021-08-30T04:33:28.388Z] No valid HEAD. Skipping the resetting [2021-08-30T04:33:28.389Z] > git clean -fdx # timeout=10 [2021-08-30T04:33:28.401Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git [2021-08-30T04:33:28.402Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key [2021-08-30T04:33:28.405Z] > git fetch --no-tags --progress --prune -- git@github.com:elastic/ecs-logging.git +refs/pull/38/head:refs/remotes/origin/PR-38 +refs/heads/master:refs/remotes/origin/master # timeout=15 [2021-08-30T04:33:28.936Z] Merging remotes/origin/master commit a617a4431fefbd10775429de7af4748e482fc9f4 into PR head commit 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c [2021-08-30T04:33:29.097Z] Merge succeeded, producing 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16 [2021-08-30T04:33:29.097Z] Checking out Revision 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16 (PR-38) [2021-08-30T04:33:28.987Z] > git config core.sparsecheckout # timeout=10 [2021-08-30T04:33:28.994Z] > git checkout -f 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=15 [2021-08-30T04:33:29.059Z] > git remote # timeout=10 [2021-08-30T04:33:29.063Z] > git config --get remote.origin.url # timeout=10 [2021-08-30T04:33:29.073Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key [2021-08-30T04:33:29.076Z] > git merge a617a4431fefbd10775429de7af4748e482fc9f4 # timeout=10 [2021-08-30T04:33:29.090Z] > git rev-parse HEAD^{commit} # timeout=10 [2021-08-30T04:33:29.100Z] > git config core.sparsecheckout # timeout=10 [2021-08-30T04:33:29.108Z] > git checkout -f 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16 # timeout=15 [2021-08-30T04:33:33.730Z] Commit message: "Merge commit 'a617a4431fefbd10775429de7af4748e482fc9f4' into HEAD" [2021-08-30T04:33:33.734Z] > git rev-list --no-walk 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10 [2021-08-30T04:33:33.806Z] Cleaning workspace [2021-08-30T04:33:33.809Z] > git rev-parse --verify HEAD # timeout=10 [2021-08-30T04:33:33.822Z] Resetting working tree [2021-08-30T04:33:33.822Z] > git reset --hard # timeout=10 [2021-08-30T04:33:33.828Z] > git clean -fdx # timeout=10 [2021-08-30T04:33:34.393Z] Masking supported pattern matches of $JOB_GCS_BUCKET or $NOTIFY_TO [2021-08-30T04:33:34.454Z] Timeout set to expire in 3 hr 0 min [2021-08-30T04:33:34.472Z] The timestamps step is unnecessary when timestamps are enabled for all Pipeline builds. [2021-08-30T04:33:34.788Z] [INFO] 'shallow' is forced to be disabled when running on PullRequests [2021-08-30T04:33:34.810Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging [2021-08-30T04:33:34.832Z] [INFO] gitCheckout: Checkout SCM PR-38 with some customisation. [2021-08-30T04:33:34.875Z] [INFO] Override default checkout [2021-08-30T04:33:34.930Z] Sleeping for 10 sec [2021-08-30T04:33:44.940Z] The recommended git tool is: NONE [2021-08-30T04:33:45.186Z] using credential f6c7695a-671e-4f4f-a331-acdce44ff9ba [2021-08-30T04:33:45.196Z] Cloning the remote Git repository [2021-08-30T04:33:45.256Z] Cloning repository git@github.com:elastic/ecs-logging.git [2021-08-30T04:33:45.321Z] > git init /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging # timeout=10 [2021-08-30T04:33:45.327Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git [2021-08-30T04:33:45.328Z] > git --version # timeout=10 [2021-08-30T04:33:45.333Z] > git --version # 'git version 2.17.1' [2021-08-30T04:33:45.333Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key [2021-08-30T04:33:45.337Z] > git fetch --tags --progress -- git@github.com:elastic/ecs-logging.git +refs/heads/*:refs/remotes/origin/* # timeout=10 [2021-08-30T04:33:45.926Z] > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10 [2021-08-30T04:33:45.929Z] > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 [2021-08-30T04:33:45.936Z] > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10 [2021-08-30T04:33:45.943Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git [2021-08-30T04:33:45.944Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key [2021-08-30T04:33:46.484Z] Checking out Revision 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c (origin/PR-38) [2021-08-30T04:33:46.504Z] Commit message: "Add more disallowed characters, require lower case" [2021-08-30T04:33:45.947Z] > git fetch --tags --progress -- git@github.com:elastic/ecs-logging.git +refs/pull/38/head:refs/remotes/origin/PR-38 +refs/heads/master:refs/remotes/origin/master # timeout=10 [2021-08-30T04:33:46.480Z] > git rev-parse origin/PR-38^{commit} # timeout=10 [2021-08-30T04:33:46.487Z] > git config core.sparsecheckout # timeout=10 [2021-08-30T04:33:46.490Z] > git checkout -f 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10 [2021-08-30T04:33:46.507Z] > git rev-list --no-walk 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10 [2021-08-30T04:33:47.096Z] Masking supported pattern matches of $GIT_USERNAME or $GIT_PASSWORD [2021-08-30T04:33:47.729Z] + git fetch https://****:****@github.com/elastic/ecs-logging.git +refs/pull/*/head:refs/remotes/origin/pr/* [2021-08-30T04:33:48.074Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging/.git [2021-08-30T04:33:48.163Z] Archiving artifacts [2021-08-30T04:33:49.193Z] + git rev-parse HEAD [2021-08-30T04:33:49.556Z] + git rev-parse HEAD [2021-08-30T04:33:49.866Z] + git rev-parse origin/pr/38 [2021-08-30T04:33:49.900Z] [INFO] githubEnv: Found Git Build Cause: pr [2021-08-30T04:33:50.095Z] Masking supported pattern matches of $GITHUB_TOKEN [2021-08-30T04:33:50.870Z] [INFO] githubPrCheckApproved: Title: Add data_stream fields to spec - User: felixbarny - Author Association: MEMBER [2021-08-30T04:33:51.215Z] Stashed 126 file(s) [2021-08-30T04:33:51.470Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging [2021-08-30T04:33:51.819Z] + git diff --name-only origin/master...8b37b7e2b728df95ae72a47f36fb0c7aca47a65c [2021-08-30T04:33:51.854Z] [INFO] isGitRegionMatch: found with regex [^spec/spec.json] [2021-08-30T04:33:52.161Z] + git log --pretty=format:* https://github.com/elastic/all/commit/%h %s origin/master...HEAD --follow -- spec [2021-08-30T04:33:52.161Z] + sed s/#\([0-9]\+\)/https:\/\/github.com\/elastic\/all\/pull\/\1/g [2021-08-30T04:33:52.347Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging [2021-08-30T04:33:52.371Z] ERROR: Pull Requests to ECS logging repos failed [2021-08-30T04:33:52.371Z] java.io.FileNotFoundException: /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging/.ci/.jenkins-loggers.yml does not exist. [2021-08-30T04:33:52.371Z] at org.jenkinsci.plugins.pipeline.utility.steps.conf.ReadYamlStep$Execution.doRun(ReadYamlStep.java:107) [2021-08-30T04:33:52.371Z] at org.jenkinsci.plugins.pipeline.utility.steps.AbstractFileOrTextStepExecution.run(AbstractFileOrTextStepExecution.java:29) [2021-08-30T04:33:52.371Z] at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47) [2021-08-30T04:33:52.371Z] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [2021-08-30T04:33:52.371Z] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [2021-08-30T04:33:52.371Z] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [2021-08-30T04:33:52.371Z] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [2021-08-30T04:33:52.371Z] at java.lang.Thread.run(Thread.java:748) [2021-08-30T04:33:52.641Z] Running on Jenkins in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38 [2021-08-30T04:33:52.720Z] [INFO] getVaultSecret: Getting secrets [2021-08-30T04:33:52.752Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID [2021-08-30T04:33:53.334Z] + chmod 755 generate-build-data.sh [2021-08-30T04:33:53.334Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61 UNSTABLE 172671 [2021-08-30T04:33:53.334Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61/steps/?limit=10000 -o steps-info.json [2021-08-30T04:33:53.334Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61/tests/?status=FAILED -o tests-errors.json [2021-08-30T04:33:53.334Z] Retry 1/3 exited 22, retrying in 1 seconds... [2021-08-30T04:33:54.677Z] Retry 2/3 exited 22, retrying in 2 seconds... ```

felixbarny commented 3 years ago

Let's try to get this one over the finishing line.

Lets try to discuss first the end goal and then work backwards from there what we can do to mitigate the issues that happen short and midterm.

The goal of ecs loggers is to be compatible with the widest range of stack versions possible. For that reason, I think it makes sense to include both the event.dataset and data_stream.dataset, and to make sure they are always sync. If ecs loggers would only add data_stream.dataset, older Filebeat versions would not add a event.dataset field. Future versions of Filebeat may drop the event.dataset fiel and create an alias in the mapping instead to optimize the storage.

ruflin commented 3 years ago

++, lets move forward with both as you suggested @felixbarny . We can always adjust / optimise later.

felixbarny commented 3 years ago

I hit an issue when testing this end-to-end with https://github.com/elastic/ecs-logging-java/pull/124 and Elastic Agent 7.11.2: The data_stream.dataset field in the log files were overridden by the Dataset name configured in the policy editor. I vaguely remember that an integration can currently only have a specific dataset but that this is something that's going to change as APM Server also needs to be able to write to different datastreams (per service.name).

Screen Shot 2021-03-18 at 16 43 33
ruflin commented 3 years ago

I think we need to open an issue for this in Beats. It bascially means we need some logic around:

if data_stream.dataset { do nothing } else { add x }

It is possible that apm-server can do this already today but not the logfile input AFAIK.

felixbarny commented 3 years ago

I've created an issue: https://github.com/elastic/beats/issues/24683

I'll mark this PR as blocked, at least until we have a consensus that this should be implemented in the near to mid-term.

ruflin commented 1 year ago

@felixbarny I assume the routing feature might bring a new twist on this discussion?

felixbarny commented 1 year ago

It definitely does. Reading this thread is almost nostalgic 😄. This thread was the reason I got pulled in more and more into routing and it seems https://github.com/elastic/elasticsearch/pull/76511 is finally landing soon.

The reroute processor will also unblock