Replace infoSteram by a logging framework (SLF4J) [LUCENE-1482]

asfimport commented 15 years ago

Lucene makes use of infoStream to output messages in its indexing code only. For debugging purposes, when the search application is run on the customer side, getting messages from other code flows, like search, query parsing, analysis etc can be extremely useful. There are two main problems with infoStream today:

It is owned by IndexWriter, so if I want to add logging capabilities to other classes I need to either expose an API or propagate infoStream to all classes (see for example DocumentsWriter, which receives its infoStream instance from IndexWriter).
I can either turn debugging on or off, for the entire code.

Introducing a logging framework can allow each class to control its logging independently, and more importantly, allows the application to turn on logging for only specific areas in the code (i.e., org.apache.lucene.index.*).

I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, as it names states, a facade over different logging frameworks. As such, you can include the slf4j.jar in your application, and it recognizes at deploy time what is the actual logging framework you'd like to use. SLF4J comes with several adapters for Java logging, Log4j and others. If you know your application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in your classpath, and your logging statements will use Java logging underneath the covers.

This makes the logging code very simple. For a class A the logger will be instantiated like this: public class A { private static final logger = LoggerFactory.getLogger(A.class); }

And will later be used like this: public class A { private static final logger = LoggerFactory.getLogger(A.class);

public void foo() { if (logger.isDebugEnabled()) { logger.debug("message"); } } }

That's all !

Checking for isDebugEnabled is very quick, at least using the JDK14 adapter (but I assume it's fast also over other logging frameworks).

The important thing is, every class controls its own logger. Not all classes have to output logging messages, and we can improve Lucene's logging gradually, w/o changing the API, by adding more logging messages to interesting classes.

I will submit a patch shortly

Migrated from LUCENE-1482 by Shai Erera (@shaie), 2 votes, resolved May 31 2010 Attachments: LUCENE-1482.patch, LUCENE-1482-2.patch, slf4j-api-1.5.6.jar, slf4j-nop-1.5.6.jar

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

This patch covers:

Using SLF4J in all the classes that used infoStream
A test which uses the JDK14 adapter to make sure it works, as well as fixing few tests which relied on some messages
Deprecation of setInfoStream(), getInfoStream() etc. in several classes who exposed this API.

Few notes:

As in many customer environments I know of the INFO level is usually turned on, and we were forbidden to output any message in the INFO level, unless it's really INFO, WARNING or SEVERE, I assumed Lucene logging messages should be in the DEBUG level (which is one less than TRACE).
I wasn't sure what to do with the set/get infoStream methods, so I just deprecated them and do nothing (i.e., setInfoStream does nothing and getInfoStream always returns a null). Not sure how's that align with Lucene's back-compat policy, but on the other hand I didn't think I should keep both infoStream and SLF4J logging in the code.
Should I attach the slf4j jars separately?
I didn't go as far as measuring performance because:
1. The code uses isDebugEnabled everywhere, which is at least judging by the JDK14 adapter very fast (just checks a member on the actual logger instance) and is almost equivalent to infoStream != null check.
2. It really depends on the adapter that's being used. I used JDK14, but perhaps some other adapter will perform worse on these calls, although I expect these calls to be executed quickly, if not even being inlined by the compiler.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Should I attach the slf4j jars separately?

If we go with SLF4J, we'd want to include the -api jar in Lucene for sure, along with a single implementation. My vote would be for the -nop implementation. Then, folks who want logging can include the implementation they like.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Doug,

I've replaced the JDK14 jar with the NOP jar and deleted the logging test I added (since NOP does not log anything).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Forgot to clean up some code in tests which made use of JDK logging.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

safeDebugMsg is protected in a public class, which means it will appear in the javadoc, which it should not. Also, logging the thread ID should be done by the logging system, not by Lucene. So that method should just be removed, no?

Also, you've added braces to all of the log statements. This is in conformance with our style guidelines, but I prefer that logging add a minimum of vertical space, so that more real logic is visible at once. I suggest you not make this style change in this patch, but propose it separately, if at all.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I kept safeDebugMsg because it was used by a class which extended IndexWriter and relied on that method to be called. However, I fixed the class by overriding testPoint instead. So I can now remove safeDebugMsg.

As for the output format, I agree that it should be handled by the logging system, but wanted to confirm that with other members before I change it. I'm glad that you agree to that too.

Attached is a new patch which removes the method.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Is there anything else I can do in order to help drive this issue forward?

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

In the mail thread, Yonik said (http://lucene.markmail.org/message/j3r53azx2tmljijg?q=Java+Logging+in+Lucene):

I'm leery of going down this logging road because people may add logging statements in inappropriate places, believing that isLoggable() is about the same as infoStream != null

They seem roughly equivalent because of the context in which they are tested: coarse grained logging where the surrounding operations eclipse the logging check.

isLoggable() involves volatile reads, which prevent optimizations and instruction reordering across the read. On current x86 platforms, no memory barrier instructions are needed for a volatile read, but that's not true of other architectures.

Thoughts on how to address this? Have you done any performance testing of this patch versus the current system, both w/ infoStream == null and infoStream != null.

I also think it is important to address Yonik's point about "inappropriate" places. In other words, we need guidelines about where and when to using logging and committers need to be on the lookout for logging uses. I realize that is as much a community policing problem as a patch problem, but, we should address them before we adopt logging.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

While the patch itself is likely minor, the implications of applying the patch are significant, IMO

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I haven't done any performance test since after following Doug's proposal, I switched to using the NOP logger which all of its isLoggable()-like calls return false. That means that OOtB calling isDebugEnabled returns immediately w/ false, w/o doing any checks. This may even be inlined by the compiler.

If someone decides to drop the slf4j-jdk14.jar (for example), then that means that someone already uses Java logging in other places in his/her application (not necessarily the search parts only) and therefore the overhead of checking whether logging is enabled is already been taken into consideration. However, I don't think it's the Lucene's community's responsibility to take care of these situations, because like I wrote before, someone might extend SLF4J and provide a very inefficient isDebugEnabled() implementation, and it wouldn't (and shouldn't) be the community's responsibility.

What's important is that Lucene uses the NOP adapter, which really does nothing.

If you insist, I can run some performance tests, but I really don't think it's necessary.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Grant, given what I wrote below, having Lucene use NOP adapter, are you still worried w.r.t. the performance implications?

If there is a general reluctance to add a dependency on SLF4J, can we review the other options I suggested - using infoStream as a class with static methods? That at least will allow adding more prints from other classes, w/o changing their API.

I prefer SLF4J because IMO logging is important, but having infoStream as a service class is better than what exists today (and I don't believe someone can argue that calling a static method has any significant, if at all, performance implications).

If the committers want to drop that issue, please let me know and I'll close it. I don't like to nag :-)

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It seems we should take into consideration the performance of a real logger (not the NOP logger) because real applications that already use SLF4J can't use NOP adapter. Solr just switched to SLF4J for example.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

Like I wrote before, I believe that if someone will use a real logger, it's most probably because his application already uses such a logger in other places of the code, not necessarily just the search parts. Therefore, the performance implications of using a logger is not important, IMO. For the sake of argument, what if some writes his own adapter, which performs really bad on isDebugEnabled() (for example) - is that the concern of the Lucene community? The way I view it - this patch gives those who want to control Lucene logging, better control of it. The fact that Lucene ships with the NOP adapter means it will not be affected by the logger's isDebugEnabled() calls. If you want to always output the log messages, drop an adapter which always returns true.

I wonder if there is a general reluctance to use SLF4J at all, and that's why you continue to raise the performance implications. Because I seriously don't understand why you think that checking if debug is enabled can pose any performance hit, even when used with a real logger. If performance measurement is what's keeping this patch from being committed, I'll run one of the indexing algoirhtms w/ and w/o the patch. I'll use the NOP adapter and the Java logger adapter so we'll have 3 measurements. However, if performance is not what's blocking that issue, please let me know now, so I won't spend test cycles for nothing.

And ... I also proposed another alternative, which is not as good as logging IMO, but still better than what we have today - offer an InfoStream class with static methods verbose() and message(). It can be used by all Lucene classes, w/o changing their API and thus allows adding more messages gradually w/o being concerned w/ API backward compatibility.

I prefer SLF4J, but if the committers are against it, then this one should be considered also.

Shai

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I'm not arguing for or against SLF4J at this point, but simply pointing out that I didn't think it was appropriate to base any analysis on the NOP adapter, which can't be used for any project already using SLF4J.

I think using a logger to replace the infostream stuff is probably acceptable. What I personally don't want to see happen is instrumentation creep/bloat, where debugging statements slowly make their way all throughout Lucene.

Because I seriously don't understand why you think that checking if debug is enabled can pose any performance hit, even when used with a real logger.

I've tried to explain - these calls can be costly if used in the wrong place, esp on the wrong processor architectures. What appears in inner loop will vary widely by application, and there are a ton of lucene users out there using it in all sorts of ways we can't imagine. For example, I'd rather not see debugging in Query/Weight/Scorer classes - for most applications, query and weight construction won't be a bottleneck, but there are some where it could be (running thousands of stored queries against each incoming document via memoryindex for example).

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

{bq} I think using a logger to replace the infostream stuff is probably acceptable. What I personally don't want to see happen is instrumentation creep/bloat, where debugging statements slowly make their way all throughout Lucene. {bq}

Grant wrote a few posts back: "I also think it is important to address Yonik's point about "inappropriate" places. In other words, we need guidelines about where and when to using logging and committers need to be on the lookout for logging uses. I realize that is as much a community policing problem as a patch problem, but, we should address them before we adopt logging."

IMO, adding logging messages to outer classes, like QueryParser, is unnecessary since the application can achieve the same thing by itself (logging the input query text, used Analyzer and the output Query object). But logging internal places, like merging, is very important, because you usually can't reproduce it in your dev env. (it requires the exact settings to IndexWriter, the exact stream of documents and the exact operations (add/remove)). Like I said, logging in Lucene is mostly important when you're trying to debug an application which is out of your hands. Customers are rarely willing to share their content. Also, community-wise, being able to ask someone to drop a log of operations that has happened and caused a certain problem is valuable. Today you can ask it only on IndexWriter output, which may not be enough.

{bq} I've tried to explain - these calls can be costly if used in the wrong place, esp on the wrong processor architectures. What appears in inner loop will vary widely by application, and there are a ton of lucene users out there using it in all sorts of ways we can't imagine. For example, I'd rather not see debugging in Query/Weight/Scorer classes - for most applications, query and weight construction won't be a bottleneck, but there are some where it could be (running thousands of stored queries against each incoming document via memoryindex for example). {bq}

I'm sorry, but I don't buy this (or I'm still missing something). What's the difference between logger.isDebugEnabled to indexOutput.writeInt? Both are method calls on a different object. Why is the latter acceptable and the former not? I'm not saying that we should drop any OO design and programming, but just pointing out that Lucene's code is already filled with many method calls on different objects, inside as well as outside of loops. The only way I think you could claim the two are different is because indexOutput.writeInt is essential for Lucene's operation, while logger.isDebugEnabled is not. But I believe logging in Lucene is as much important (and valuable) as encoding its data structures.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'm going to go out on a limb and say this one is too contentious to make 2.9.

If you strongly disagree, please do feel free to flip it back.

asfimport commented 15 years ago

Shai Erera (@shaie) (migrated from JIRA)

I'm not even sure if this issue should be kept around, given the responses I got to it. The question is - do we think logging should be improved in Lucene or not? If YES, then let's keep the issue around and come back in 3.0. If NO, then let's cancel it. It hasn't been commented on for \~6 months, so there's no point to keep it around unless we think it should be resolved at some point

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

So who is in favor of this? Personally, I'm not - Shai put up the call for support months ago - one more chance I guess - without, I will close this soon.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Well ... since Mark hasn't closed it yet (thanks Mark :)), I thought to try once more. Perhaps w/ the merge of Lucene/Solr this will look more reasonable now? I personally feel that just setting InfoStream on IW is not enough. I don't think we need to control logging per level either. I think it's important to introduce this in at least one of the following modes:

We add SLF4J and allow the application to control logging per package(s), but the logging level won't matter - as long as it's not OFF, we log.
We add a static factory LuceneLogger or something, which turns logging on/off, in which case all components/packages either log or not.

I think (1) gives us greater flexibility (us as in the apps developers), but (2) is also acceptable. As long as we can introduce logging messages from more components w/o passing infoStream around ... On #3415 for example, a closeSafely method was added which suppresses IOExceptions that may be caused by io.close(). You cannot print the stacktrace because that would be unacceptable w/ products that are not allowed to print anything unless logging has been enabled, but on the other hand suppressing the exception is not good either ... in this case, a LuceneLogger could have helped because you could print the stacktrace if logging was enabled.

asfimport commented 14 years ago

Jukka Zitting (migrated from JIRA)

We use SLF4J in Jackrabbit, and having logs from the embedded Lucene index available through the same mechanism would be quite useful in some situations.

BTW, using isDebugEnabled() is often not necessary with SLF4J, see http://www.slf4j.org/faq.html#logging_performance

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I still think that calling isDebugEnabled is better, because the message formatting stuff may do unnecessary things like casting, autoboxing etc. IMO, if logging is enabled, evaluating it twice is not a big deal ... it's a simple check.

I'm glad someone here thinks logging will be useful though :). I wish there will be quorum here to proceed w/ that.

Note that I also offered to not create any dependency on SLF4J, but rather extract infoStream to a static InfoStream class, which will avoid passing it around everywhere, and give the flexibility to output stuff from other classes which don't have an infoStream at hand.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Closing this one as we clearly don't reach a consensus about it. And it introduces a static variable to each class that is interested in logging, which is a problem if Lucene is used inside the same JVM by several instances - turning the logging on for one application will affect all the rest as well.

Anyway, it's been open for too long and I don't see that we're anywhere near a consensus (perhaps rightfully), so no point keeping it open.

apache / lucene

Replace infoSteram by a logging framework (SLF4J) [LUCENE-1482] #2556