389ds / 389-ds-base

The enterprise-class Open Source LDAP server for Linux
https://www.port389.org/
Other
211 stars 90 forks source link

nsslapd-logging-backend syslog should log instance name and error/access #3436

Open 389-ds-bot opened 4 years ago

389-ds-bot commented 4 years ago

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/50377


When the syslog backend logger is used to log ns-slapd log messages all logging becomes a single stream of log messages. The implied information from file system log file paths and names is not included. This results in messages logged to syslog not containing enough information to know the instance name that generated the log message, or if it is an access log or an error log.

Package Version and Platform

Package: 389-ds-base-1.3.8.4-23.el7_6.x86_64 Platform: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Steps to reproduce

  1. set the configuration attribute nsslapd-logging-backend to syslog
  2. restart the dirsrv instance
  3. perform any ldap operations
  4. review the system syslog files

Actual results

The syslog file contains something like

2019-05-15T10:24:02 host user.info ns-slapd[59081]: conn=10 fd=64 slot=64 connection from 1.1.1.1 to 1.1.1.1 2019-05-15T10:24:02 host user.info ns-slapd[59081]: conn=10 op=0 BIND dn="cn=Directory Manager" method=128 version=3 2019-05-15T10:24:02 host user.info ns-slapd[59081]: conn=10 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0000191639 dn="cn=directory manager" 2019-05-15T10:24:02 host user.info ns-slapd[59081]: conn=10 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="subschemaSubentry"

Expected results

all log lines are prefaced with "log=[access|error] instance=[instance_name] "

2019-05-15T10:24:02.210-07:00 host user.info ns-slapd[59081]: log=access instance=name conn=10 fd=64 slot=64 connection from 1.1.1.1 to 1.1.1.1 2019-05-15T10:24:02.211-07:00 host user.info ns-slapd[59081]: log=access instance=name conn=10 op=0 BIND dn="cn=Directory Manager" method=128 version=3 2019-05-15T10:24:02.211-07:00 host user.info ns-slapd[59081]: log=access instance=name conn=10 op=0 RESULT err=0 tag=97 nentries=0 etime=0.0000191639 dn="cn=directory manager" 2019-05-15T10:24:02.212-07:00 host user.info ns-slapd[59081]: log=access instance=name conn=10 op=1 SRCH base="" scope=0 filter="(objectClass=*)" attrs="subschemaSubentry"

This will allow the syslog stream to explicitly contain all of the implied information that is available from the file system log file names.

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2019-05-16 03:34:58

Hi there,

This is an interesting point, that we should annotate these structures. Currently, we are actually in the middle of a reasonably large discussion about a logging system rewrite, as I think the current system may not easily support the format changes you would like to see. These links are relevant to the discussion and topics.

3409

2474

https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org/thread/YGPRPWN7PQWPH7ILEP3Z5GBPTSBKXNQB/

So I think the answer is "yes I would like to do this" but it may be in 1.4.x, and as part of the changes we are implementing over the coming weeks/months. Do you have any other log information or changes you may like to see? I think if you wanted to participate in the thread, and open some other design ideas, that would be great.

Thanks,

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2019-05-16 03:34:59

Metadata Update from @Firstyear:

389-ds-bot commented 4 years ago

Comment from erempel at 2019-05-16 07:32:49

3409

I have a concern with "holding" onto a bunch of log elements so that they can be logged as a cohesive block.

For example, perhaps a query almost always takes 100ms but some times it takes 500ms. If log messages were logged individually at the time the action took place, it could be seen that every time the action took 500ms, there was a different query that was started in the middle of the first query that caused a performance degradation that resulted in the first query taking 500ms. This type of interaction between actions can not be observed if logs are collected and logged as a block.

I think there is already enough detail in the log line to correlate all actions. The connection (conn=) and operation (op=) allow the correlation of complete actions.

If you are going to add statistics and debug messages they should be clearly marked like the BIND, SRCH, RESULT etc with other key words such as STAT, DEBUG etc.

2474

I agree that logging needs to perform well. It is my observation over the past 20 years that logging to disk files is almost always slower than logging to syslog. We see syslog peak rates in the 50,000 messages per second commonly and sustained rates in the 30,000 messages per second.

What turns out being more useful is the ability to increase or decrease the logging level while the process is active (not having to restart). I have seen where the signals of USR1 and USR2 are used to increase and decrease the logging level of a running process.

REFERENCE POINT DEBUG

Just thinking while typing, it would be possible to generate a completed debug level stream of log messages while buffering the most recent messages (based on time or message count), but only send to the log output handlers messages that are above the normal logging level (say notice). If a severe error is appended to the stream, the log level of messages that are sent to the log output handlers could be temporarily increased to debug until the severe error gets sent to the log output handlers. At that time the log level is returned to the previous notice level. This gives you debug level logging leading up to the severe error while doing regular notice level logging the rest of the time.

I feel that I need to point out that debug logging is useful to developers but of limited use to systems administrators, but that is a different philosophical discussion.

https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org/thread/YGPRPWN7PQWPH7ILEP3Z5GBPTSBKXNQB/

There is a lot of information in that thread. I'm not sure if I should post here, or post to that thread. I am posting here to keep all of my feedback in one place.

The thread started with these ideas

As concepts these are all desirable. Of course the details of how these are implemented is the key to success.

The next points are that there are different reasons for logging.

What is being described here is what the logs are used for, and one use case is not listed. Let me describe this a little different.

Debugging - There are two sides to this.

One is for debugging code errors. This is only useful to the developers, but the actual developers don't have access to the logs created in the production environment of some company running the product. So the best that can be hoped for is for these logs to be included when a bug report is filed from the end user to the development team. The more detail that is logged, the more useful a bug report can be (think stack trace possible with data values or decision process). This level of detail is unrealistic for continuous logs in a production environment. See REFERENCE POINT DEBUG above.

The second debugging log use is for debugging data errors. Some operation could not be completed because the data was invalid. This type of error is very useful to the organization that is running ldap. They may have a data flow from some authoritative identity process that is not compatible with ldap. We had a client who tried to set their password using emogis! These errors are of limited use to the developers with the exception of feature requests.

Auditing. - useful to auditors.

Auditors are interested in who, what, when and where. Access to data, change (also delete) of data. This is more about the story of what happened. Who logged in, where did they log in from, what did they access and what did they change. Taking a page from the SQL community, there is a configuration of what audit logs to include. none, access, modify (includes delete), denied. Perhaps in the ldap context there might be more granular access or modify in terms of classes of attributes. I'm not an ldap specialist so I can't comment too much on that.

I think statistics are fairly obvious, but I'll add details to be thorough. Statistics include details of specific transactions and averaged transactions. Details may be

With statistics one needs to think of monitoring. Perhaps configurable alert thresholds can be set. Additional log lines of "over threshold X of y" can be logged. This permits systems alerting to be easily triggered before a service outage as a result of what is effectively a denial of service occurs.

The class of logs that is missing is the rest of the operational activity. Service start/stop. Data files opened closed. Data corruption (different from invalid incoming data), replication logs. active/passive fail over and promote, multi-master conflict resolution, backup, restore, bulk load,

There were other things in that thread.

"merging of logs: I do not like the idea to have everything in one log file, if you say we can provide tools to split, we can also provide tools to merge. The logs have a different format and for good reasons, eg the audit log is ldif like and can easily be used to replay changes, why force it into another format."

I used to feel the same way, however, it turns out to be much more difficult to merge logs than it is to split them. To split them you just filter them and you are done. When merging them you need to sort them according to the logged timestamp. If any log message have the same timestamp they may be sorted out of order which means we don't get a proper representation of what actually happened. Isn't that what logs are supposed to tell us. What happened? Additionally, sorting data is one of the most computationally expensive operations. Splitting logs is much less computationally intensive provided the log messages are tagged with useful details.

The other item is

"Also, in my opinion, the different logs have a different life time. You might want to keep all audit logs, keep the error log for the main events for some longer time than the fast growing access log, defining log rotation based on the log type is a benfit in my opinion"

This statement seems to presume the retention intentions of the organization operating the ldap service. Leave the retention decisions to the company that runs ldap. If they want different retention then let them split/filter the logs for their purposes. Since merging logs is more difficult than splitting them, let the company split them and retain what they want.

Finally, I'm not sure how we connect outside of this issue report but I'm willing to discuss a lot more. My primary role at my company is real time log processing and alerting.

https://www.uvic.ca/search/q/directory.php?qtype=pers&pidm=10005543

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2019-05-16 07:55:03

3409

I have a concern with "holding" onto a bunch of log elements so that they can be logged as a cohesive block.

  • many logging systems and log processing systems (SIEM - Security Event Information Management system) such as splunk, loggly, Riemann and others all rely on single line log messages. Logging a block is very difficult to work with.

They can also work with structured logs, and the log block would be emitted as multi-line with indicators. I've used SIEM's before and even run splunk at home, so I'm designing this to work with them.

  • there is no way to see time differences between start and completion events.

Actually, this is improved because you can see precise timing of start and end of operations - we also aren't removing the access log so you still have those timing details.

An example being:

{
    [ timestamp, "start event"],
    ....
    [ timestamp, "end event" ],
}

We could even provide automated durations here to accompany this (taken from monotonic to prevent clock skew/shifts affecting timings).

  • there is no way to see/debug overlapping tasks.

Hmmm. I find that overlapping log info makes debugging harder - I do a lot of 389 debugging so this could just be my opinion though :). I'd rather see each operation in isolation, along with details about what timings and states the operation went through. Given the server design, most resources are meant to be isolated from other threads anyway, so any interaction between threads would highlight in the detailed timings.

For example, perhaps a query almost always takes 100ms but some times it takes 500ms. If log messages were logged individually at the time the action took place, it could be seen that every time the action took 500ms, there was a different query that was started in the middle of the first query that caused a performance degradation that resulted in the first query taking 500ms.

I'd argue we get better information because we can then profile the shared resource in the block that caused the delay - IE this plugin took 450ms because the lock was held by the other thread.

This type of interaction between actions can not be observed if logs are collected and logged as a block. I think there is already enough detail in the log line to correlate all actions. The connection (conn=) and operation (op=) allow the correlation of complete actions. If you are going to add statistics and debug messages they should be clearly marked like the BIND, SRCH, RESULT etc with other key words such as STAT, DEBUG etc

We aren't removing the access log (it was discussed, and we decided not to), so this will remain in it's current format.

2474

I agree that logging needs to perform well. It is my observation over the past 20 years that logging to disk files is almost always slower than logging to syslog. We see syslog peak rates in the 50,000 messages per second commonly and sustained rates in the 30,000 messages per second. What turns out being more useful is the ability to increase or decrease the logging level while the process is active (not having to restart). I have seen where the signals of USR1 and USR2 are used to increase and decrease the logging level of a running process.

I think we can already do this, without signals. Just change the loglevel in cn=config via ldap. The design of the new system already has a plan to accommodate live changes to log details, so I hope this concern is covered for you :)

REFERENCE POINT DEBUG Just thinking while typing, it would be possible to generate a completed debug level stream of log messages while buffering the most recent messages (based on time or message count), but only send to the log output handlers messages that are above the normal logging level (say notice). If a severe error is appended to the stream, the log level of messages that are sent to the log output handlers could be temporarily increased to debug until the severe error gets sent to the log output handlers. At that time the log level is returned to the previous notice level. This gives you debug level logging leading up to the severe error while doing regular notice level logging the rest of the time. I feel that I need to point out that debug logging is useful to developers but of limited use to systems administrators, but that is a different philosophical discussion.

Well, it's great when a sysadmin has a problem that they can be empowered to investigate themself, or be able to give us the info we need, rather than asking for core dumps or similar. We have a serious issue where people say "ohh my server was slow at 3am". and we have nothing else to proceed on. Having a higher level by default, means admins can retrospectively give us the details we need.

https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org/thread/YGPRPWN7PQWPH7ILEP3Z5GBPTSBKXNQB/

There is a lot of information in that thread. I'm not sure if I should post here, or post to that thread. I am posting here to keep all of my feedback in one place. The thread started with these ideas

The performance of logging should be improved The amount of details (fine grain) and information in logs should improve The structure of the log content should be improved to aid interaction (possibly even machine parsable)

As concepts these are all desirable. Of course the details of how these are implemented is the key to success. The next points are that there are different reasons for logging.

statistics auditing (authentications, modifications) debugging (for my usage the most important aspect)

What is being described here is what the logs are used for, and one use case is not listed. Let me describe this a little different. Debugging - There are two sides to this. One is for debugging code errors. This is only useful to the developers, but the actual developers don't have access to the logs created in the production environment of some company running the product. So the best that can be hoped for is for these logs to be included when a bug report is filed from the end user to the development team. The more detail that is logged, the more useful a bug report can be (think stack trace possible with data values or decision process). This level of detail is unrealistic for continuous logs in a production environment. See REFERENCE POINT DEBUG above. The second debugging log use is for debugging data errors. Some operation could not be completed because the data was invalid. This type of error is very useful to the organization that is running ldap. They may have a data flow from some authoritative identity process that is not compatible with ldap. We had a client who tried to set their password using emogis! These errors are of limited use to the developers with the exception of feature requests. Auditing. - useful to auditors. Auditors are interested in who, what, when and where. Access to data, change (also delete) of data. This is more about the story of what happened. Who logged in, where did they log in from, what did they access and what did they change. Taking a page from the SQL community, there is a configuration of what audit logs to include. none, access, modify (includes delete), denied. Perhaps in the ldap context there might be more granular access or modify in terms of classes of attributes. I'm not an ldap specialist so I can't comment too much on that. I think statistics are fairly obvious, but I'll add details to be thorough. Statistics include details of specific transactions and averaged transactions. Details may be

  • duration (scoped to type of operation)
  • data throughput
  • simultaneous connections
  • simultaneous users
  • % utilization of maximum settings (80% of max connections reached)

We also want to add timing for code sections, IE memberof took this long, backend took this long etc. So you know what plugin or segment is causing delays.

With statistics one needs to think of monitoring. Perhaps configurable alert thresholds can be set. Additional log lines of "over threshold X of y" can be logged. This permits systems alerting to be easily triggered before a service outage as a result of what is effectively a denial of service occurs.

This is a good point. If the etime > X, log more detail so it can be investigated. This could be a good balance between verbosity and information for admins to diagnose potential pain points.

The class of logs that is missing is the rest of the operational activity. Service start/stop. Data files opened closed. Data corruption (different from invalid incoming data), replication logs. active/passive fail over and promote, multi-master conflict resolution, backup, restore, bulk load,

Ahhh true. I have (foolishly) lumped these into the server error log as it currently exists, but I think you correctly identify that it should be a seperate idea of "here is my current status, all okay, you can see how the server is going".

There were other things in that thread. "merging of logs: I do not like the idea to have everything in one log file, if you say we can provide tools to split, we can also provide tools to merge. The logs have a different format and for good reasons, eg the audit log is ldif like and can easily be used to replay changes, why force it into another format." I used to feel the same way, however, it turns out to be much more difficult to merge logs than it is to split them. To split them you just filter them and you are done. When merging them you need to sort them according to the logged timestamp. If any log message have the same timestamp they may be sorted out of order which means we don't get a proper representation of what actually happened. Isn't that what logs are supposed to tell us. What happened? Additionally, sorting data is one of the most computationally expensive operations. Splitting logs is much less computationally intensive provided the log messages are tagged with useful details.

This is true - part of my thinking here, is that the new operation log will be a combination of the current access/audit/error log, so you could theoretically disable those three, and use the operation log in isolation to get this detail.

The other item is "Also, in my opinion, the different logs have a different life time. You might want to keep all audit logs, keep the error log for the main events for some longer time than the fast growing access log, defining log rotation based on the log type is a benfit in my opinion" This statement seems to presume the retention intentions of the organization operating the ldap service. Leave the retention decisions to the company that runs ldap. If they want different retention then let them split/filter the logs for their purposes. Since merging logs is more difficult than splitting them, let the company split them and retain what they want.

Well, if we don't split them, it makes the retention harder we think. Assume someone who does not have a SIEM or other central log system is a user here. We need to give them choices around the rotation and retention just within 389 itself.

So for you with syslog + siem, I think you have plenty of ways to choose to retain from a single stream of events going into syslog, but others who only rely on the on-disk logs like access/audit/error, may have different needs.

Finally, I'm not sure how we connect outside of this issue report but I'm willing to discuss a lot more. My primary role at my company is real time log processing and alerting. https://www.uvic.ca/search/q/directory.php?qtype=pers&pidm=10005543

Commenting here is just fine - alternately, you can email the mailing list (you may need to join first though ... ), or to me directly - wbrown at suse.de.

I appreciate your really thoughtful and extensive comment. I think there are some good things in here for me to review again and consider further.

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2019-05-16 08:02:24

PS: I'm working on a design document at the moment, and will move some of these ideas to that, but I will summarise the content later when I am closer to development to allow design review to help ensure I'm not missing points that you are raising. Thank you!

389-ds-bot commented 4 years ago

Comment from firstyear (@Firstyear) at 2019-05-22 02:43:16

Metadata Update from @Firstyear:

389-ds-bot commented 4 years ago

Comment from erempel at 2019-05-22 03:23:37

3409

I have a concern with "holding" onto a bunch of log elements so that they can be logged as a cohesive block.

  • many logging systems and log processing systems (SIEM - Security Event Information Management system) such as splunk, loggly, Riemann and others all rely on single line log messages. Logging a block is very difficult to work with.

They can also work with structured logs, and the log block would be emitted as multi-line with indicators. I've used SIEM's before and even run splunk at home, so I'm designing this to work with them.

Although structured logs can be stored and presented, appropriate parsing for indexing purposes and searching are still very problematic for most log visualization engines. I would give some serious thought about using structured log messages. Structured metadata is a different thing but that does not apply to the actual message.

there is no way to see time differences between start and completion events.

Actually, this is improved because you can see precise timing of start and end of operations - we also aren't removing the access log so you still have those timing details. An example being: { [ timestamp, "start event"], .... [ timestamp, "end event" ], }

We could even provide automated durations here to accompany this (taken from monotonic to prevent clock skew/shifts affecting timings).

You have to keep in mind that logs are often not viewed in isolation. As a sysadmin I might be looking at the ldap logs along side kernel scsi IO messages, or firewall logs. By delaying the "story" that the ldap logs are telling until the end of the chapter makes correlation with the rest of the logs very difficult.

Think of a problem case where an ldap action takes minutes to complete (I know that doesn't happen) but the log will not be produced until the end of the minute. For the whole minute, there is no indication that anything is being done.

The other difficulty with this is that it becomes near impossible to filer different logging details for different retention policy. If you add code timings to this log object like you have alluded to below, it would be difficult to filter out those times from the long term retention of logs. What would happen is that all of the different kind of log messages:

all end up into the same log message and can no longer be filtered.

All of the extra detail that is not very useful to operational activities starts to get in the way of operational responses.

I can see how trace level logging might want to have a single event profile, and if you think of that log line as a "timing" class log line it makes sense as a single log event. The operational logs messages tell a different story. - logged in, search, update, logout. These are all different classes of actions and should be logged independently and at the time they occur.

there is no way to see/debug overlapping tasks.

Hmmm. I find that overlapping log info makes debugging harder - I do a lot of 389 debugging so this could just be my opinion though :). I'd rather see each operation in isolation, along with details about what timings and states the operation went through. Given the server design, most resources are meant to be isolated from other threads anyway, so any interaction between threads would highlight in the detailed timings.

You need to keep in mind that it isn't just the overlapping ldap operations, there might be overlapping OS level logs as well. A fictitious example

  1. request arrives
  2. memory is requested
  3. OS reports no contiguous memory
  4. OS takes 2 seconds to to allocate contiguous memory
  5. request is processed.

... etc.

There might be other overlapping events such as the firewall be restarted or a disk defragment taking place. Perhaps a backup is running or a deduplication data scan. Maybe there was a hot add of a CPU for virtualized environments. There are literally hundreds of other events that might affect the function of the ldap server than need to be viewed in-line with the ldap logs.

For example, perhaps a query almost always takes 100ms but some times it takes 500ms. If log messages were logged individually at the time the action took place, it could be seen that every time the action took 500ms, there was a different query that was started in the middle of the first query that caused a performance degradation that resulted in the first query taking 500ms.

I'd argue we get better information because we can then profile the shared resource in the block that caused the delay - IE this plugin took 450ms because the lock was held by the other thread.

With the operation ID (or even more granular thread ID or others) this can easily be viewed/parsed from the log. You are really just talking about presentation, which should be done with presentation tools. Logs are time series data and should be natively presented in the order the events occur. Profiling shared resources is a trace or debugging log that has very limited use to operational requirements.

This was a short response but I don't want to gloss over the issue that it is not the source logging that should be dealing with presentation. Presentation should be done at the viewing time. We see log sources that use variable numbers of spaces to align log data from one log line to the next. Trying to manipulate, process, alert on these logs becomes a computational nightmare due to the need for regular expressions to handle the variable white space, all because presentation was attempted at logging source.

I can't emphasize enough that presentation should not be considered for source logging. It is a disaster in every case I have seen it.

2474

I agree that logging needs to perform well. It is my observation over the past 20 years that logging to disk files is almost always slower than logging to syslog. We see syslog peak rates in the 50,000 messages per second commonly and sustained rates in the 30,000 messages per second. What turns out being more useful is the ability to increase or decrease the logging level while the process is active (not having to restart). I have seen where the signals of USR1 and USR2 are used to increase and decrease the logging level of a running process.

I think we can already do this, without signals. Just change the loglevel in cn=config via ldap. The design of the new system already has a plan to accommodate live changes to log details, so I hope this concern is covered for you :)

Again, you are thinking only within the scope of ldap. Many IT shops have different ldap admins in an identity department, and different systems administrators in some infrastructure group. The infrastructure group, may need to adjust the logging level, but they don't have access to update ldap data, so they can not change anything in cn=config

REFERENCE POINT DEBUG Just thinking while typing, it would be possible to generate a completed debug level stream of log messages while buffering the most recent messages (based on time or message count), but only send to the log output handlers messages that are above the normal logging level (say notice). If a severe error is appended to the stream, the log level of messages that are sent to the log output handlers could be temporarily increased to debug until the severe error gets sent to the log output handlers. At that time the log level is returned to the previous notice level. This gives you debug level logging leading up to the severe error while doing regular notice level logging the rest of the time. I feel that I need to point out that debug logging is useful to developers but of limited use to systems administrators, but that is a different philosophical discussion.

Well, it's great when a sysadmin has a problem that they can be empowered to investigate themself, or be able to give us the info we need, rather than asking for core dumps or similar. We have a serious issue where people say "ohh my server was slow at 3am". and we have nothing else to proceed on. Having a higher level by default, means admins can retrospectively give us the details we need.

https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org/thread/YGPRPWN7PQWPH7ILEP3Z5GBPTSBKXNQB/ There is a lot of information in that thread. I'm not sure if I should post here, or post to that thread. I am posting here to keep all of my feedback in one place. The thread started with these ideas The performance of logging should be improved The amount of details (fine grain) and information in logs should improve The structure of the log content should be improved to aid interaction (possibly even machine parsable) As concepts these are all desirable. Of course the details of how these are implemented is the key to success. The next points are that there are different reasons for logging. statistics auditing (authentications, modifications) debugging (for my usage the most important aspect) What is being described here is what the logs are used for, and one use case is not listed. Let me describe this a little different. Debugging - There are two sides to this. One is for debugging code errors. This is only useful to the developers, but the actual developers don't have access to the logs created in the production environment of some company running the product. So the best that can be hoped for is for these logs to be included when a bug report is filed from the end user to the development team. The more detail that is logged, the more useful a bug report can be (think stack trace possible with data values or decision process). This level of detail is unrealistic for continuous logs in a production environment. See REFERENCE POINT DEBUG above. The second debugging log use is for debugging data errors. Some operation could not be completed because the data was invalid. This type of error is very useful to the organization that is running ldap. They may have a data flow from some authoritative identity process that is not compatible with ldap. We had a client who tried to set their password using emogis! These errors are of limited use to the developers with the exception of feature requests. Auditing. - useful to auditors. Auditors are interested in who, what, when and where. Access to data, change (also delete) of data. This is more about the story of what happened. Who logged in, where did they log in from, what did they access and what did they change. Taking a page from the SQL community, there is a configuration of what audit logs to include. none, access, modify (includes delete), denied. Perhaps in the ldap context there might be more granular access or modify in terms of classes of attributes. I'm not an ldap specialist so I can't comment too much on that. I think statistics are fairly obvious, but I'll add details to be thorough. Statistics include details of specific transactions and averaged transactions. Details may be

  • duration (scoped to type of operation)
  • data throughput
  • simultaneous connections
  • simultaneous users
  • % utilization of maximum settings (80% of max connections reached)

We also want to add timing for code sections, IE memberof took this long, backend took this long etc. So you know what plugin or segment is causing delays.

Timings for code sections is more about trace level logging. Trace level logging is not useful to operation teams. If you are going to enable such logs, they need to be done in a way that they can be filtered out and have a different retention.

If you are going to add that much detail to the logs, you may want to have configuration "slow action" logging. So classes of actions can have a timing threshold above which they would be logged. Or perhaps a set of thresholds that define the "level" of the timing into log level class such as info, notice, warning, critical etc, and they they would be logged according to the loglevel set for timing class of messages. This would allow each site to define the verbosity of timing logs which could be very voluminous.

Something like

plugin_duration=(1 5 10)

to map timings to info, warning, error

then

plugin_loglevel=warning

would result in all plugin durations of 5 or more milliseconds to be logged, along with any other warning level or higher log message from plugins.

With statistics one needs to think of monitoring. Perhaps configurable alert thresholds can be set. Additional log lines of "over threshold X of y" can be logged. This permits systems alerting to be easily triggered before a service outage as a result of what is effectively a denial of service occurs.

This is a good point. If the etime > X, log more detail so it can be investigated. This could be a good balance between verbosity and information for admins to diagnose potential pain points.

It is more than just timings. Essentially anywhere there is a limit. Number of connections, amount of RAM, query rates, update rates, replication delay, replication data rate (think leased lines and costs). Just giving you food for thought.

The other item is "Also, in my opinion, the different logs have a different life time. You might want to keep all audit logs, keep the error log for the main events for some longer time than the fast growing access log, defining log rotation based on the log type is a benfit in my opinion" This statement seems to presume the retention intentions of the organization operating the ldap service. Leave the retention decisions to the company that runs ldap. If they want different retention then let them split/filter the logs for their purposes. Since merging logs is more difficult than splitting them, let the company split them and retain what they want.

Well, if we don't split them, it makes the retention harder we think. Assume someone who does not have a SIEM or other central log system is a user here. We need to give them choices around the rotation and retention just within 389 itself. So for you with syslog + siem, I think you have plenty of ways to choose to retain from a single stream of events going into syslog, but others who only rely on the on-disk logs like access/audit/error, may have different needs.

I am not confident that the retention desired will match the boundary of how 389 will split the logs.

Short term retention (used for debugging/monitoring).

For the purposes of monitoring and debugging, only a recent set of logs is required. The disk savings for rotating these different classes is so small it isn't worth talking about. Just keep x days of logs at the host and ship all the logs to the long term retention framework whether that is centralized logging or just copying the files to some storage place does not matter. I consider our university with 20,000+ students and 5,000+ staff a larger than average ldap site. our ldap logs are only 15GB per day, so having 2-3 days of logs without filtering anything is trivial. Allocating 100-200 GB of resources to a core service like ldap should be easy to justify.

Long Term retention (possibly off-line)

If a company want to have different retention for different logs I would argue that the company should define the retention policy and filter/retain what they wants. Thinking that the retention is at a complete log file is a little forceful. For instance, I might want to keep all login and access for 1 year, but only login and modify for 5 years and possibly queries and the durations (possibly even post processed for averages) for 3 years.

In reality what I think you will find is that all logs are retained for the same duration. The cost of storage compared to log size is cheap and it is better to have the logs than not. So I don't think this is really something that should be encouraged. What should be encouraged is to keep all logs for some retention period. This means that all of the logs can go into the same file for the same retention.

Evan.

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2019-06-13 17:21:22

Metadata Update from @mreynolds389:

389-ds-bot commented 4 years ago

Comment from mreynolds (@mreynolds389) at 2020-03-04 16:52:18

Metadata Update from @mreynolds389: