apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Fully embrace the java module system [LUCENE-10255] #11291

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I've experimented a bit trying to move the code to the JMS. It is {}surprisingly difficult{}... A PoC that almost passes all checks is here: -https://github.com/dweiss/lucene/tree/jms- https://github.com/dweiss/lucene/tree/jms2

Here are my conclusions so far:


Migrated from LUCENE-10255 by Dawid Weiss (@dweiss), updated Feb 01 2022 Attachments: screenshot-1.png, screenshot-2.png Linked issues:

Sub-tasks:

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There are differences at runtime that are hard to anticipate - for example resource lookups via class loader no longer work (I fixed this in Luke).

That was always an anti-pattern, should not be used anywhere. The problem of Luke was that some resources were in a different package. With module system you can only load resources from same package that your class is in.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

A Java module contains metadata (such as the module version or main class) that is completely detached from any source file. These things live in a class bytecode of the compiled module-info; interestingly, there is no source-level way to specify it - these class attributes are injected by the 'jar' tool. Gradle has some fancy on-the-fly asm conversion filter that injects it.

Normally you have a module-info.java that is compiled by javac? Or is this further additional metadata? Because version numbers should not exist for modules, the JMS requires you to encode it into module name. Like a module "org.apache.lucene.core.v9".

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Probably the biggest challenge (not covered in the PoC) are with our custom javadoc and ecj linter tasks - they see the module-info.java and can't cope with it. At the same time, there is no easy way to exclude that one particular file: ecj would have to accept a full set of sources (command argument limit will be a problem), javac can accept a full set of java sources (external file) but then it doesn't copy doc-files properly anymore (this is probably easier to fix).

How about putting the module-info.java into a separate directory (maybe next to build.gradle) and tell javac to also read that file. So it will be invisible to all source processing tools like ecj or javadoc.

asfimport commented 2 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Probably the biggest challenge (not covered in the PoC) are with our custom javadoc and ecj linter tasks - they see the module-info.java and can't cope with it. At the same time, there is no easy way to exclude that one particular file: ecj would have to accept a full set of sources (command argument limit will be a problem), javac can accept a full set of java sources (external file) but then it doesn't copy doc-files properly anymore (this is probably easier to fix).

I'm checking out your branch, I will try to look into some of this part. I tried to prototype such stuff before (likely going not nearly as far as you did) and made some progress, although I can't remember what I did :)

I would expect that e.g. the javadoc task, we may have to change -sourcepath to --module-source-path and similar stuff like that. maybe similar stuff for ecj. I'm sure I am oversimplifying it.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> Normally you have a module-info.java that is compiled by javac? Or is this further additional metadata?

This is additional metadata injected by 'jar'. It's actually kind of funny to look at the source code of that... See "-module-version=" and "--main-class=" options here: https://docs.oracle.com/en/java/javase/17/docs/specs/man/jar.html

> Because version numbers should not exist for modules, the JMS requires you to encode it into module name. Like a module "org.apache.lucene.core.v9".

It's not really like this. Modules can have numbers; it's just they're not currently used for anything (you can't make a versioned dependency or enforce it). This is "outside of the scope of the jms". But you can see those versions, for example:

java --list-modules
...
jdk.sctp@14.0.1
jdk.security.auth@14.0.1
...

The branch I created embeds Lucene's version into the created JAR (of Luke) as well.

I also thought about what you suggested - keep a separate source folder for just the module-info.java. It is an alternative that may be the easiest one to follow (until all subprojects have module-info.java information). I just thought it was... not elegant somehow. Seemed hacky... But given the complexity of workarounds for ecj and javadoc it may be a lesser evil - I'll look into this.

Robert - don't spent too much time on this; I'll reiterate with Uwe's suggestion - it seems like a cleaner alternative. When all subprojects have module-infos it may be easier to just move the tooling to use full modules (instead of the classpath).

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I tweaked the build so that the changes required to existing tasks and infrastructure are minimal. I checked that Luke works (as a module) and I also converted lucene.core to a module... and things seems to work beautifully so far. In fact, you can immediately see the gain of the jms because the compiler shows you API glitches - package-private classes in arguments or return types of public API classes (I had to disable that lint check in the patch, for now).

It's quite fun.

https://github.com/apache/lucene/pull/470

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I converted Luke to a full module - it works

Yes... I noticed that a few days ago when I saw the new Luke start scripts.

It works but there is one side effect - its "Analysis > Preset analyzers" no longer work since this had relied on reflection tricks that are not compatible with the module system. While I don't have any objection we also need to drop the zombie function (GUI layer) from Luke. I have little time for now but will try to find time for that.

 

Luke is a GUI app - which means it (unfortunately) has many features that cannot be covered by unit tests. It is totally okay to use this for experimenting purposes; but could you please let me know when you change its infrastructure...? I'll try to make necessary follow-ups or adjustments to the GUI for the changes.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It works but there is one side effect - its "Analysis > Preset analyzers" no longer work since this had relied on reflection tricks that are not compatible with the module system. While I don't have any objection we also need to drop the zombie function (GUI layer) from Luke. I have little time for now but will try to find time for that.

I'd replace this by a hardcoded list as beginning.

Generally Luke should support CustomAnalyzer and build its analyzers using components based on SPI, which is of course a bit much work for quickly testing.

Another option is to add SPI functionality to analyzers, but I disagree with that. IMHO at some points those hardcoded analyzers (which is code duplication everywhere, one for each language) should go away and replaced by a factory class returning preconfigured CustomAnalyzer instances. E.g. like Analyzer ana = Analyzers.german().

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Generally Luke should support CustomAnalyzer and build its analyzers using components based on SPI, which is of course a bit much work for quickly testing.

Yes Luke has already supported CustomAnalyzer ever since I rewrote it (though I think few of you may know that). Its "Preset analyzers" serves the default of "Analysis" UI, but this is for convenience - we can drop it and replace the default with "Custom analyzers".

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You can still run it in classpath mode, right? Then it'll work like before.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

You can still run it in classpath mode, right? Then it'll work like before.

Of course. I also noticed that Luke's module-info.java currently only depends on lucene-core, not analyzers-common, so you are limited anyways with analysis (only StandardTokenizer is provided by SPI no filters nothing...). What would be the best user experience here? Add some commandline option to add analyzers-common to module path, or add it by default (e.g. with --add-modules lucene.analyzers.common)? Or how can you make it pluggable without falling back to classpath mode?

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You overestimate my experience with the module system, Uwe... As far as I understand, prior to this PR, everything would work because automatic modules are in effect a loophole to get into the entire unnamed module space. After this PR - Luke won't see those SPIs (I think...). I just wanted something that is minimal, compiles and launches - I'm sure there are quirks looking to be addressed.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It will see the SPIs if the module is added to the runtime. This can be done directly inside module-info, or dynamically on command line.

If somebody writes an application and depends on both modules, you will see all SPIs. If you want it pluggable just add modules at runtime using commandline parameter or with the module system API (Luke could add an UI to select JAR files and then query their SPIs using the module reflection APIs).

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Oh I figured out, theoretically it should be enough to have it on module-path. No command line parameters needed. Let me play around :-) Maybe it did not work because of the mix "analyzers-common is an automatic module" vs. "lucene-core" is real module.

So in general, adding a JAR file to the module-path should make the SPI implementations inside visible to ServiceLoader.

To test this we should maybe do so module-enablement of lucene-codecs and see if they appear, or modularize analyzers-common.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

There is also require static, right? So that Luke could declare the optional requirement (that can be unsatisfied)?

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I mean - I'll try to follow core/infrastructure changes and make necessary modifications on Luke GUI; meanwhile, I don't have sufficient spare time to follow every change. As a fact, since 9.0 Luke's Analysis UI is broken from the users' perspective, even if it is the right decision for the future of the project. I'd be glad that if you just let me know when you change luke module's infrastructure before it's not too late.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

there's also "uses com.service.api.AnInterface;"

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This looks like helpful: https://www.infoq.com/articles/java11-aware-service-module/

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> As a fact, since 9.0 Luke's Analysis UI is broken from the users' perspective

There is still time to reiterate the release and modify those scripts to run Luke in classpath mode. Or fix the module declaration so that it sees what it needs to see. I'll take a look at this.

I don't think I will be able to explicitly shout out when something is changed that affects Luke - had I known about the problem, I'd have tried to fix it... (I ran Luke and toyed around but didn't click on each and every option).

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

As far as I see: lucene.core must also declare "uses XyzFactory" because the cod calling ServiceLoader is there. This also enables it to "see" implementations from unrelated modules on classpath.

It works at moment, because lucene-core is a singleton modue and all it needs is inside. But the factories won't see codecs from other modules unless you say "uses".

Let's go tep by step. We have to also make sure that tests run with dependencies as "modules".

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> Let's go tep by step. We have to also make sure that tests run with dependencies as "modules".

This should be handled automagically by gradle... Much like compilation already is.

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

There is still time to reiterate the release and modify those scripts to run Luke in classpath mode. Or fix the module declaration so that it sees what it needs to see. I'll take a look at this.

I would keep it as is - anyway the reflection hack is not good and should be dropped. I can't take time for now, but I will make a follow-up on 9.1.

I don't think we should reiterate 9.0 release vote for Luke.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

No worries, Tomoko. Thank you for taking your time and pointing at the problem! Uwe is on it. I'm sure if I send a box of candy to @jpountz he won't mind making a third respin of 9.0...

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Please, no... I don't think there is any good reason to stop ongoing release for this.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We won't have real modules in 9.0, so who cares about this bug? If the reflection hack shows StandardAnalyzer it's fine.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

If I have some time the next days I will read more about Services and Modules. There were some pitfalls which you don't notice easily unless you have many SPI splattered over several modules.

So we should first go and enable a real module-info for things like codecs module or backwards module. And then see what happens and add the correct provides/uses in combination with requires/exports. Because exporting does not make the SPI interface available for consumption, it must be marked as SPI first, so the module system knows "this interface" is a service provider. That's all missing (it just works because it is all self-contained).

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Yes we need to declare the SPI's we provide for external use, see java.base exaple:

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/module-info.java#L368-L398

It says "uses" for all the SPI interfaces it exports. In addition the implementations are listed like in Dawid's PR.

So we must add:

  uses org.apache.lucene.codecs.Codec;

Because our code consumes the Codec SPI provided by other modules. Downstream modules like Luke do not need to do this, because they don't call ServiceLoader directly, they use a factory method Codec.forName().

and also the implementations (already done):

  provides org.apache.lucene.codecs.Codec with
      org.apache.lucene.codecs.lucene90.Lucene90Codec;
asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The rules are simple: Code that calls "ServiceProvider.load() must have "uses..." in the module-info. For Lucene all the SPI stuff is done inside lucene.core (be sure and grep on the code for ServiceLoader), so we need to add an extry "uses..." for every SPI (Codec, PostingsFormat, TokenizerFactory, TokenFilterFactory - also for those we have no implementations in core, but we consume all when people use TokenFilterFactory.forName(...).

Should I commit to your branch? Maybe the Luke problems with CustomAnalyzer goes away then.

Implementation classes of services must be listed as "provides".

Of course all classes must be exported, but that's already done.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Tomoko, can you explain what is broken? If it makes Luke hardly usable for most users I'd vote for respinning.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

@jpountz, there are another Bug in Luke's startup scripts. Because of missing quotes in both the Windows and Linux shell scripts, Luke does not start.

I will open blocker issue. I am currently testing the Lucene release and this was the first thing I noticed (you know most Windows users have whitespace in their user name, so when you download to desktop and double click, luke does not start at all).

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

See #11295

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The issue with Luke that Tomoko was referring to is this: screenshot-1.png

It only shows StandardAnalyzer, no other analyzers. I don't think it's fatal, but not so nice. We may fix it as described by removing the analyzers or switch to the "custom" view by default:

screenshot-2.png

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

(with full module system support as in Dawid's PR, the custom view does not work, too).

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> you know most Windows users have whitespace in their user name

Right... not me - that's why I didn't notice. Listen... there is also the possibility of just reverting all those module-related changes and going back to plain classpath. I don't mind (although I think it's a step backwards).

Uwe - please feel free to commit to this branch anything you think is appropriate, that's why I posted it - counting on your feedback.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We don't need to revert module system for luke's startup scripts in 9.0, but we should fix 2 issues:

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

About the analyzers... I am thinking of a solution to replace the hack. Otherwise I tend to show only StandardAnalyzer in screenshot, but maybe change default to a carefully populated custom analyzer?

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

As I said - let's keep it as it is in 9.0. Please don't disturb the release for this nor revert the module-related stuff.

Could you please read my lengthy explanation if it matters to you:

@jpountz thanks, there is no critical issue that should be reported to RM in Luke. If so, I would cast a -1 instead of complaining here. Please go ahead!

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

11295 is a horrible experience on windows. PR is there and we agreed to fix it already. The Analyzer question was discussed and personally I see no big issues.

The version number in info box was just an observation, but problem is also known.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

In short I agree with you on analyzers or revert of modules.

Still before any decisions about Module system, I want to open an older index from 8.x and check if it works (requires SPI to work). Will try tomorrow, it's now too late.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi,

About compatibility with the to-be-released Lucene/Luke 9.0 and auto-modules: I opened a Lucene 8.8 index yesterday (I chose a random ZIP from the backwards tests of branch 9.0) in Luke 9.0.0 (non RC, fresh built) yesterday. So discovery with auto-modules worked. That's good.

Of course with module-info files we need to declare the "uses" (in lucene-core) and "provides" clauses (all modules that have SPI implementations for each entry in META-INF/services) clauses.

Uwe

asfimport commented 2 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

About compatibility with the to-be-released Lucene/Luke 9.0 and auto-modules: I opened a Lucene 8.8 index yesterday (I chose a random ZIP from the backwards tests of branch 9.0) in Luke 9.0.0 (non RC, fresh built) yesterday. So discovery with auto-modules worked. That's good.

Thank you!

 

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I would like to bring in one more thing to investiagte: The current Lucene modules as of 9.0 are named without full reverse domain names. We should investigate on other ASF projects if there is a "standard" how to name modules. I don't like it that the Maven group:artifact name is totally different from the module name. IMHO the Lucene module should be named with "org.apache.lucene.XXXX" instead of plain "lucene.XXXXX".

The log4j module uses this pattern already, and we should coordinate that. Maybe ASF has a standard already. I'd ask on the "Apache Commons" project to figure out how they plan to handle it.

Changing the current syntax of module name is not a problem, because except for Luke we don't expose the modules in our documentation.

As said before I am in favor to name the modules like "groupid.artifactid" based on Maven coordinates (append with "." inbetween).

Thoughts?

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I did that intentionally. I hate those long prefixes. They make life much more complicated and I don't think there's a risk of running into a conflict with anything existing...

java -m org.apache.lucene.core sounds way less attractive than just java -m lucene.core.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Think this way: java's internal modules don't have the domain prefix either - they rely on the uniqueness of the first part (jdk., java.). I think this is sufficient. No need to be paranoid./

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

java -m org.apache.lucene.core sounds way less attractive than just java -m lucene.core

This is IMHO no argument for shorter names: If your own project is using the module system then you have a module-info.java, too. Then you can start it without hassle and won't specify any extra options.

I would ask around if there's a standard already. I would really like to see consistent module names. "java", "jdk" prefix is different, because Java never had any modules names before, but Maven has/had package prefixes. There were discussions about this already on JDK mailing list together with Maven people, but I have to find them. I think Maven uses the artifact coordinates also for module name, but I am not 100% sure. Maybe @rfscholte has some more information what community standards have evolved.

-1 to use "lucene" as module name prefix, +1 to use "org.apache.lucene" as prefix.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sorry but I remain unconvinced that typing a million times "org.apache." in various contexts wins you or me anything. Sure - maven coordinates are there as an example where this sort of makes sense (because all of the bazillion artifacts live under the same namespace tree). The module system is different though - there will be no name conflicts there if you shorten the module name to just "lucene". I don't see any gain in prefixing it with anything - the opposite, adding a prefix is a nuisance if the 'lucene' prefix is sufficiently unique to guarantee no conflicts with anything else.

Even in the maven namespace some people opt for shorter prefixes (including various Apache commons libraries) [1].

[1] https://repo1.maven.org/maven2/commons-net/commons-net/

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, please have an overview on Maven central and the good work done by Christian Stein:

This table has module names and their artifact names extracted by a script from Maven central: https://github.com/sormuras/modules/blob/main/doc/Top1000-2020.txt.md (see also the repo: https://github.com/sormuras/modules)

When looking at the Top 1000, you will se that all module names that can be found on Maven Central use the package names / coordinate names. If you read the JLS, they recommend for packages and modules only "simple names" for small projects without large outreach.

Here is conclusion what he recommends: https://sormuras.github.io/blog/2019-08-04-maven-coordinates-and-java-module-names.html

So I just repeat myself: Module names should really be unqiue. I don't care about 9.0, because its not officially announced, but when we enable the module system we should use unique names.

Or should we rename also all packages in Lucene's source code and strip off "org.apache."?

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

You don't understand me, Uwe. I agree on maven central coordinates. I don't agree on full prefixes for module naming. I think "lucene." is unique enough. This is a subjective opinion and it's really no convincing me otherwise. If you want to push the full prefix - I'll live with it, but I don't agree it is necessary or useful or solves anything.

asfimport commented 2 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We can disagree, for sure, but I'd like to get some more opinions. This MUST be a community decision. I gave my well educated opinion and invite everybody to read this blog post: https://sormuras.github.io/blog/2019-08-04-maven-coordinates-and-java-module-names.html; Christian Stein explains very well how a module name should look like.

The module names inside java/jdk are short, but the same is for package names. There is also the satement: The package names in every module should start with the module name (this is not always fully possible, but a good rule is that module name and package name should have a common prefix).

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sure, Uwe. I think I expressed my personal opinion. :) Some of our current module naming cannot be converted to modules (anything with a dash). If you want consistency then the first step would be to rename those modules in the repo.