Closed jasontedor closed 5 years ago
Pinging @elastic/es-core-infra
Pinging @ruflin; let me know if I missed anything!
In some scenarios, an exception occurs, a large amount of logs will be printed, tens of MB per second, if it can be optimized, the same type exception is suppressed ,as print info :
"2016-07-09 11:01:31 +0430 [warn]: suppressed same stacktrace"
A large number of exception log writes, seriously affecting disk write performance.
Now that we have a Logging UI in Kibana, we were hoping to show ES deprecation (and potentially other types of) logs in the UI. The user would discover said logs via the Monitoring UI, filtered down to the context they are in: a cluster, a node within a cluster, etc. In order for this to work correctly, item 2 in the checklist above needs to be implemented:
Each log line should contain the node ID and cluster ID (machine friendly, and serve as unique identifiers), and node name and cluster name (human friendly). Filebeat wants to see these for instances where logs are being collected to a single sink from multiple nodes and multiple clusters.
Any chance this specific item could be prioritized to enable the Monitoring UI use case described above?
Any chance this specific item could be prioritized to enable the Monitoring UI use case described above?
We have been meaning to follow-up on this requirement, because as written we can’t meet it. We need to understand more how important and how strong this requirement is. It is simply not possible include the node ID and cluster ID on every log line. We emit log lines before these data are available.
We emit log lines before these data are available.
Is this true for all types of logs? For example, would it be possible to emit the node ID and cluster ID on every log line for deprecation logs?
Are there any guarantees about when the data becomes available or, conversely, what types of log messages might not have this data in them since it's not available yet?
Yes, it could be true for the deprecation log too. Also, it would be that we are considering combining all the log files into a single log file for all instances, not only for the Docker images.
Are there any guarantees about when the data becomes available or, conversely, what types of log messages might not have this data in them since it's not available yet?
There aren't guarantees per se. They can be categorized as all startup log messages up to the point that we recover node state would not have the node ID, and all startup log messages up to the point that we recover cluster state would not have the cluster ID (superset of those without the node ID).
Thanks Jason. Roughly how many log entries are we talking about without cluster UUID or node ID? The reason I ask is: thinking a bit out of the box, would it be silly to re-emit those messages with the IDs in them, once the IDs become available? Just want to consider all possibilities before we say it's okay to "drop" certain messages that don't have IDs.
Roughly how many log entries are we talking about without cluster UUID or node ID?
I can not bound it because we can always add more in the future, and if debug or trace logging is enabled (or even only for certain components) it can be quite a lot of log messages.
In that case, I'm in favor of doing a best effort here. That is, I'm in favor of ES including the cluster UUID and node ID on every log line as and when that information becomes available.
It would be an improvement over what we can do today in the UI with the logs in terms of correlating them with other information about the node or cluster. And we'd still have the un-correlatable logs (but fewer in number now) around in filebeat indices, in case we later think of some way to display these meaningfully to the user.
I opened a draft PR to discuss the technical details of the implementation. Feel free to review/comment.
In regards to the format itself. The idea would be to change the log line from:
[2018-12-19T11:10:33,609][INFO ][o.e.l.LicenseService ] [node-0] license [1d313428-f9d3-4085-81e0-c77f032359a1] mode [basic] - valid
To:
{"type": "rolling", "timestamp": "2018-12-19T12:18:58,402", "level": "INFO ", "class": "o.e.l.LicenseService", "cluster_name": distribution_run", "node_name": "node-0", "cluster_uuid": "dlQqhmTYQ3OfoF0eLIdsCw", "node_id": "uy-oB2QFS1m73g8Gg9yAXw", "message": "license [3cbed08b-c198-4b3d-b7d6-62efb94c1c2e] mode [basic] - valid"}
Before the first ClusterStateUpdate is received, the cluster_id and node_id would not be present in log lines:
{"type": "rolling", "timestamp": "2018-12-19T12:18:57,525", "level": "INFO ", "class": "o.e.t.TransportService", "cluster_name": distribution_run", "node_name": "node-0", "message": "publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}"}
Or can they be null? like: "cluster_uuid": null, "node_id": null
In order to support multiple output streams from docker we will have a type
field that indicate what appender is the log from. For instance from deprecation log:
{"type": "deprecation_rolling","timestamp": "2018-12-19T11:24:24,949", "level": "WARN ", "class": "o.e.d.l.RestPostStartTrialLicense", "node_name": "node-0", "node_id": "MGKz9QIFTyS-IR85E5J2Fg", "nodeid2": "unknown_node_id", "message": "[POST /_xpack/license/start_trial] is deprecated! Use [POST /_license/start_trial] instead."}
I am stil not sure how to format stacktraces in the log lines. Should they also contain the fields mentioned in 1st post by Jason?
also, should we allow users to change the pattern we create now in a JSON form? I could imagine some users would be tempted to do this and we don't have a way to control how they change it and if they follow the json structure.
Hi @pgomulka this is looking great!
I am stil not sure how to format stacktraces in the log lines. Should they also contain the fields mentioned in 1st post by Jason?
You mean because they are multiline? If so, what about escaping the newlines in the message
field in the JSON?
I have a request for one more logging change to support the Beats Elasticsearch module. Please let me know if I should add it as a checklist item in this issue's description OR if you'd prefer that I make a new issue OR if this comment is sufficient.
Currently we see lines like these in the Elasticsearch logs:
[2018-07-03T11:45:45,604][WARN ][o.e.m.j.JvmGcMonitorService] [srvmulpvlsk252_md] [gc][3449992] overhead, spent [1.6s] collecting in the last [1.8s]
Notice the durations in these lines, 1.6s
and 1.8s
. Would it be possible to normalize the unit here, perhaps to ms
? That would make parsing these durations a lot easier.
@ycombinator in regards to this duration units inconsistency, that might be a bug I think. Can you please a raise a separate issue?
@pgomulka Done! https://github.com/elastic/elasticsearch/issues/36896
@ycombinator In regards to exception formatting. I am thinking of a format like this:
{"type": "console", "timestamp": "2018-12-27T14:45:10,386", "level": "WARN ", "class": "r.suppressed", "cluster_name": "distribution_run", "node_name": "node-0", "cluster_uuid": "XSKjGHufRR25tdwlkpCTig", "node_id": "iP2awnO9QX6ajH2z6w1byQ", "message": "", "path": "/myindex", "params": "{index=myindex}", "exception": ["java.lang.NullPointerException: asdf",
"at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:149) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:596) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:287) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:685) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:211) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:143) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:834) [?:?]"] }
To make a stacktrace human readable in JSON format I have enclosed each line of stacktrace in a separate array element. I hope this should still be easy for beats to parse. Let me know what you think
Also I am wondering if we should add a new schema to ECS. We are after all integrating beats with ES. We have log.yaml schema at the moment, I was thinking of having a schema with all fields we will use:
type;
timestamp;
level;
component;
cluster.name;
node.name;
cluster.uuid;
node.idd;
message;
Array exception;
Another problem to consider is how should we handle exceptions, messages written directly to the console. That is not a problem when running elasticsearch from a binary - log files will be parsed - but might be an issue when running from docker.
For instance if elasticsearch fails bootstrap checks we will write the exception message directly to the org.elasticsearch.cli.Terminal
terminal.println(Terminal.Verbosity.SILENT, "ERROR: " + e.getMessage());
In Boostrap.java we remove ConsoleAppender, expecting this to print out from Terminal again later.
// disable console logging, so user does not see the exception twice (jvm will show it already)
final Logger rootLogger = LogManager.getRootLogger();
final Appender maybeConsoleAppender = Loggers.findAppender(rootLogger, ConsoleAppender.class);
if (foreground && maybeConsoleAppender != null) {
Loggers.removeAppender(rootLogger, maybeConsoleAppender);
}
but since docker is only using ConsoleAppender, it will miss the message in json format and will later only print out the non-json message on a Terminal
Logs from docker: {"type": "console", "timestamp": "2019-01-04T16:23:22,319+0100", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "elasticsearch", "node.name": "Przemyslaws-MacBook-Pro.local", "message": "explicitly enforcing bootstrap checks" } ERROR: [1] bootstrap checks failed [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.zen.ping.unicast.hosts, discovery.zen.hosts_provider, cluster.initial_master_nodes] must be configured {"type": "console", "timestamp": "2019-01-04T16:23:22,339+0100", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "Przemyslaws-MacBook-Pro.local", "message": "stopping ..." }
Logs from log file
{"type": "console", "timestamp": "2019-01-04T17:12:03,836+0100", "level": "ERROR", "component": "o.e.b.Bootstrap", "cluster.name": "distribution_run", "node.name": "node-0", "message": "node validation exception\n[1] bootstrap checks failed\n[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.zen.ping.unicast.hosts, discovery.zen.hosts_provider, cluster.initial_master_nodes] must be configured" }
Should we aim to format Terminal messages as well?
Hey folks
I'm hoping to get some traction on this issue. We're interested in pushing forward this deprecation log in the monitoring UI initiative and we've discussed that we need cluster_uuid
in the deprecation log in order to move forward, as we can't definitely know which cluster that deprecation log appeared on.
Any update here?
@chrisronline the PR #36833 with ES changes is on track to be merged into 7.0. This will have cluster.uuid, but as discussed earlier only once the ES started up. see comment
you can see sample messages in here
@pgomulka Thanks! That's helpful for sure!
We're hoping to get our side done before 7.0 so users can take advantage of the UI for the 7.0 upgrade. Is it possible to tackle this work in phases, where the final phase is the PR you have linked, but we can have an initial phase where we add the cluster_uuid
to the deprecation log in the current format?
@chrisronline in my view it would be better to push that all together. The majority of the work is related to propagating cluster.uuid and node.id and testing. Most of the test I wrote already assume the json format, so that would be actually more work now to isolate that change Unless that would mean that the json format itself would be postponed?
@danielmitterdorfer @nik9000 any views on this?
I agree with @pgomulka that we should push this together considering that the PR is already in review.
@chrisronline @albertzaharovits Should we consider unite security audit logs and Elasticsearch logs to some degree? Seems like cluster.uuid field is not present in audit logs: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/config/log4j2.properties Chris, is monitoring UI going to look into these information as well?
@ycombinator is Beats going to be happy if those logs are different?
Hi @pgomulka, when you say "unite", I assume you mean that they should have some common fields like cluster.uuid
and node.uuid
. That is definitely desirable because we intend to show all different types of logs in the Monitoring UI per cluster and per node.
Purely from Filebeat's perspective it's okay if the logs are different since each log type (which corresponds to a fileset in Filebeat) provides its own piece of the ES mapping. Of course, what we wouldn't want to do is have a field called cluster_uuid
in one log and cluster.uuid
in another, both meaning the same thing. But AFAICT, that's not the case — all logs that have cluster UUID in them use cluster.uuid
.
@rjernst raised an important problem that the readability is greatly affected with this change. We can work that around with keeping the old logs and new logs together.
Old logs still using previous pattern and *.log
filetype suffix.
New logs on the other hand would use a new pattern and *.json
file type suffix.
That might be concerning that there will be duplicated amount of logs (and files), but on the other side we don't log a massive amounts anyway.
Any views on this?
Old logs still using previous pattern and .log filetype suffix. New logs on the other hand would use a new pattern and .json file type suffix.
Just wanted to point out that the audit log is not following this convention currently: plaintext logs are being sent to {clustername}_access.log
and JSON logs are being sent to {clustername}_audit.log
. Starting 7.0 they plan to just emit the JSON logs IIRC. /cc @albertzaharovits
raised an important problem that the readability is greatly affected with this change. We can work that around with keeping the old logs and new logs together. Old logs still using previous pattern and
*.log
filetype suffix. New logs on the other hand would use a new pattern and*.json
file type suffix. That might be concerning that there will be duplicated amount of logs (and files), but on the other side we don't log a massive amounts anyway.Any views on this?
I am a bit skeptical about this for the following reasons:
logs/
and possibly end up with duplicate logs in their log environment.For the record, with the introduction of json logging, at least the docker use case looks really nice and easy to read esp. piping it to jq
with something as simple as:
docker logs <container_name> | jq .
producing very readable output IMHO like:
One concern though is the possible amount of frustration with new users not preparing themselves for this switch and frantically looking on how to change back to normal log output.
I chatted with @pgomulka earlier and suggested whether it makes sense to ship the log4j2.properties
files that #36833 bring, clearly marked with comments and then an additional section with the normal old-style logging, fully commented out, ready to be uncommented out.
For example:
WDYT?
I chatted with @pgomulka earlier and suggested whether it makes sense to ship the log4j2.properties files that #36833 bring, clearly marked with comments and then an additional section with the normal old-style logging, fully commented out, ready to be uncommented out.
Un-commenting the section requires a node restart. In the case of the audit log, we went with both formats enabled out-of-the-box, with two independent loggers, and we strongly advised disabling the logger for the old format by using the cluster settings API.
Just wanted to point out that the audit log is not following this convention currently: plaintext logs are being sent to {clustername}_access.log and JSON logs are being sent to {clustername}_audit.log. Starting 7.0 they plan to just emit the JSON logs IIRC.
This is correct.
Old logs still using previous pattern and .log filetype suffix. New logs on the other hand would use a new pattern and .json file type suffix. That might be concerning that there will be duplicated amount of logs (and files), but on the other side we don't log a massive amounts anyway.
We named the new audit log {clustername}_audit.log
instead of {clustername}_audit.json
because the file is not really a properly formatted json document (It is not an array of comma separated json docs, it is a log of JSON docs separated by endlines).
The json format will be a default one in the long run. However, we will have a transition period where we will keep log files in both JSON and previous formats. This will allow us to create additional tooling and gather feedback from users in regards to readability. Since keeping two sets of files will have an impact on storage we will reduce the log accumulation size from 2GB to 1GB As a followup we will schedule a removal of the previous log format.
Since keeping two sets of files will have an impact on storage we will reduce the log accumulation size from 2GB to 1GB
We can consider this; I am not convinced that it's a change that we need to make.
I think it's worth mentioning on top of @pgomulka 's notes that, going forward, logs on the console for the docker image will be in json
format by default.
If we go down the path of keeping both flavors of logs - plaintext and json - I'd like to know what globs/extensions each flavor would use. That way Filebeat can look in the right places and know how to parse what it sees.
@ycombinator we will produce following log files with new JSON format
${cluster_name}_server.json
${cluster_name}_deprecation.json
${cluster_name}_index_search_slowlog.json
${cluster_name}_index_indexing_slowlog.json
and following log files with an old pattern (not containing newly added IDs like cluster.uuid)
${cluster_name}.log
${cluster_name}_deprecation.log
${cluster_name}_index_search_slowlog.log
${cluster_name}_index_indexing_slowlog.log
gc.log
${cluster_name}_audit.log - I feel *.json would be ok anyway
This is perfect, thanks @pgomulka!
@pgomulka Looking closer at the log file names, I have a couple of questions:
For the server log, it appears that today we write it to the log file ${cluster_name}.log
. Just want to confirm that once this issue is resolved, we will be writing it to ${cluster_name}_server.log
instead? If that's the case, then I will make sure to update the current Filebeat elasticsearch module to look for either one.
What about gc and audit log files? Could you add their names to both lists in your comments so we have the complete list in one place?
Thanks!
I do not think we should change the name of the existing logs; that is a breaking change with little benefit to us.
I have renamed the ${cluster_name}.log
to ${cluster_name}_server.log
for consistency, but agree with @jasontedor that it won't benefit us and it would be a unnecessary breaking change.
We will stick with ${cluster_name}.log
for old-format logs. updated comment
I am however wondering if audit logs should not be in ${cluster_name}_audit.json
instead of ${cluster_name}_audit.log
. I understand @albertzaharovits concerns about this file not being a json document. I think however that this is widely used and ok to have multiple json documents in a file named *.json
. Notice that we use that in our Elastic documentation as well sample tutorial from ES
GC logs stay untouched.
There is not direct benefit for Elasticsearch but I think we should change this for Filebeat to make collection easier. Right now on the Filebeat side, we define what a server log is by excluding all the other log types:
exclude_files: [".gz$","_slowlog.log$","_access.log$","_deprecation.log$"]
If a new log type is added, this config breaks which I don't think it should. Having *_server.log
as pattern would be then enough and not exclude patterns are needed.
I just had a sync up with @pgomulka . Here the conclusions:
*.log
names will stay the same in 6.x and 7.x*.json
logs will get the _server
postfix@ycombinator The last point has some effects on the Filebeat module. For Filebeat 7 I think the default should be to only have the JSON logs in the path patterns. If someone wants to ingest the old logs from 7, some manual config changes for the paths are needed by the user. We need docs for this.
@pgomulka Is there a way that the Elasticsearch user can disable the "double" logging and only have JSON logs?
@ruflin Users are free to opt out from either logs by deleting unwanted appenders from log4j2.properties
. I also mentioned about that in migration docs.
We are aware of the additional log size, but during core-infra meeting we agreed that this is not a major concern. We can reduce the log size if needed by changing the rolling parameter accumulation size
from 2GB to 1GB. comment
This relates to the effort in Beats (filebeat) to build a module for Elasticsearch. To support this effort, we want to make some changes to the logs in Elasticsearch including removing some existing pain points. The main highlight of these changes will be shipping with JSON logging.
server.log
. Today the main log file is<clustername>.log
and this creates complexity for Filebeat; it would be simpler for them if they could rely on an include pattern based on a predictable log file name (another possibility here isserver-<clustername>.log
). Note that this will be a breaking change.timestamp
,log_level
,component
,node_name
,index_name
,shard_id
,message
).