apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.04k forks source link

Problem with SPI components (e.g., AnalysisSPILoader) when refreshing from different classloader and closing previous classloader #13101

Open AndreyBozhko opened 9 months ago

AndreyBozhko commented 9 months ago

Description

Applications like Solr allow users to reference token filters, tokenizers, etc. by their SPI names - and Lucene provides utility methods to discover and load these services.

The discovery relies on AnalysisSPILoader#reload(ClassLoader) method which scans the provided classloader's classpath, and loads and caches available services.

However, if this method is called again (with a different classloader), it makes no attempt to invalidate the cache of available services.

This means that if:

then:

This behavior places restrictions on how the applications can reload the SPI - and it does not work well with how Solr manages the lifecycle of its classloaders.

Specific example

Basically, whenever Solr adds jars to classpath at runtime, it replaces its current classloader with a new instance of URLClassLoader, closes the old classloader, and reloads the SPI.

So this may lead to a situation where the AnalysisSPILoader cache keeps a reference to MyServiceFactory class that was defined by classloader1 (which is now closed because Solr closed it and switched to classloader2).

As a result, when the application code retrieves MyServiceFactory by its SPI name, and the factory executes code which requires loading some other class from classloader1's classpath - the loading fails because Solr already invoked ((URLClassLoader) classloader1).close().

And from users's perspective, this issue manifests in rather cryptic ClassNotFoundException.

Version and environment details

Lucene 9.8.0

uschindler commented 9 months ago

This is a general problem of Service providers in Java.

As Lucene only has a single instance in the parent classloader (Lucene is part of Solr's classloader) it only has a single lookup to find analysis components.

Your proposal will cause the same problem if other cores. If you unload a core and it removes it's analysis components, the anal is in another core will fail from that point on.

To fix this correctly, Solr would need to have a separate Lucene impl per core. This is undoable. In general the per core classpaths in Solr are strange. Codecs, PostingsFormats, DocvaluesFormats and analysis components should be loaded in same classloader like Solr and Lucene itself.

There is no solution to this well known problem.

AndreyBozhko commented 9 months ago

Thanks for taking a look @uschindler - I have to admit that I didn't fully grasp your comment about having a separate Lucene per core.

My original understanding was that for the example I described, the loading and reloading of SPI occurs during the Solr node (and not core) startup. So very early in Solr node's lifecycle, it would load all Lucene classes (makes total sense that these are per-node), and also load other libraries (user-provided plugins) shared at the node level.

But even if we only deal with shared classes, users can still encounter a ClassNotFoundException similar to the example above.

@HoustonPutman Please correct me if there's something that I misinterpreted about the issue from our initial discussion.

AndreyBozhko commented 9 months ago

I do realize that the issue title may have sounded too prescriptive - I certainly didn't mean that. But I would definitely like to have a discussion on what path forward (if any) makes the most sense.

I realize that the feedback may be along the lines of "The behavior of Lucene's AnalysisSPILoader is set in stone, and Solr should just work around that".

Or, as a hypothetical alternative, Lucene could expose a way to invalidate the SPI cache, and applications like Solr may choose to use that (having full control, but also full responsibility over what they're doing).

Or perhaps there are other ways to go about this.

What are your thoughts?

uschindler commented 9 months ago

Hi, I think you should also open an issue in Apache Solr, because basically the problem is more on Solr's side. As said before th

I also thought yesterday evening about the same ide: Allowing to clear the cache. The problem with that is the following:

I could think of another idea:

I could live with that, but I have to state: It is just a workaround for something that is generally broken in Solr. JARs/plugins with Analysis components should be installed in the uber Solr classloader and not per core. This has already brought various problems with ICU plugin and hundreds of support issues of people that have no idea where to place the JAR file and which config files to change. It also caused hazard if different versions of same SPI components lurking in different cores. This would also cause major hazards here.

I tend to think that Solr should "deprecate" the per-Core classloader and have the "lib/" syntax not in solrconfig.xml but in the solr.xml that boots the Solr framework. This is why I think there should be a separate Solr issue.

The current behaviour of the existing method should not be changed, as it works like other SPI components seen in Java's core framework. All of them cache the loaded SPIs (XML components, PNG image loaders, filesystems,....). Lucene only has the option to find "new ones" not yet discovered (to add ones suddenly dropped in or for plugins running in separate classloaders like Elasticserach/Opensearch has). But those plugins should never be unloaded. Therefore removing no longer visible SPIs from the cache is against the usual conventions in Java world.

P.S.: The same problems you see with reloading web applications in Tomcat. That's the reason why nobody does that anymore (it was cool at around 2005, but never worked). Solr's classloader per core has the same problem like Tomcat.

@HoustonPutman @noblepaul I am not familar with recent plugin architecture: Do we have "Server level plugins in Solr already"? I'd really suggest to make sure that people can't load analysis or codec/postingsformat/... (+ soon to be coming vector similarities) SPI plugins for individual core via solrconfig.xml anymore, but have to install those plugins top-level in Solr's main class loader - hopefully with the new plugin architecture. I hope the plugin infrastructure allows that.