apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.24k stars 639 forks source link

When will the 4.8.0 version be released? #793

Closed zhanghaixingxing closed 3 weeks ago

rclabo commented 1 year ago

This question was asked a few months ago and the answer hasn't likely changed much. Please see #778. In that thread you will see references to #437 Comment which contains a wealth of information on the topic.

In all of this, one thing you should keep in mind is that many people (even Microsoft!) currently use portions of Lucene 4.8 in production. So that is to say the product is already very stable.

Companies using 4.8 in production just need to be awere there may be some minor API changes on the road to a final 4.8 release. But honestly, that's not a lot different than using a Production release that has a major new release.

jeffreywstevens commented 1 year ago

In my opinion, if you know people are using it, and you feel it is stable, you might as well call it a release and remove the beta designation.

I am confident it works for most use cases. However, I can't get some developers to consider it. If it gets escalated, it will most likely get dropped.

I feel it is time to go forward.

Shazwazza commented 1 year ago

I'm definitely in agreement with @jeffreywstevens here. I know there are some recent commits and changes done in the past several months which would warrant a new beta release, but after that I also think we should procure an RTM release. Any changes after that can just be patch versions.

Would need buy in from @rclabo + @NightOwl888

rclabo commented 1 year ago

I'm totally on board with that. In the past two years I've had to make way more changes to my code base due to ASP.NET Core API changes than I have had to for Lucene.NET 4.8 API changes. I feel that LuceneNET 4.8 is super stable (hats off to @NightOwl888!!!) and is worthy of an RTM release.

Sure there are some aspects that may not be perfectly on par with Java Lucene 4.8 (OpenNLP comes to mind), but those areas tend to be auxiliary functionality that have no easy route for porting.

The core functionality seems rock solid and awesome from my perspective. And I think so many more devs will use this amazing project if it's RTM.

rclabo commented 1 year ago

I know @NightOwl888 is under a lot of pressure right now due to a deadline on another project, so he might not be able to chime in for a while.

That doesn't mean he isn't interested in this discussion; it just means he is juggling a lot at the moment.

Shazwazza commented 1 year ago

@rclabo IIRC there were quite a lot of commits and fixes since the last release, do you think we should look to ship one more beta?

rclabo commented 1 year ago

@Shazwazza Probably, but honestly...I feel like that question is above my pay grade :-)

I have enormous respect for @NightOwl888 and would certainly defer to his judgment.

laimis commented 1 year ago

I have some time on my hands and can dedicate helping out with the efforts.

Just from looking over the NuGet download stats, the 4.8 beta packages outnumber the last production 3.3 release, in terms of downloads:

https://www.nuget.org/stats/packages/Lucene.Net?groupby=Version

image

These numbers could be misleading and inflated due to automated CI builds but still paint a good picture for 4.8 usage. Some of us are using 4.8 betas in prod without issues, and anecdotally we hear about that from other people too. If another beta release makes prod 4.8 a reality, let's go for it.

Until @NightOwl888 can chime in, I will start pulling together the changelog and see what the new beta release would look like, and we can try pushing that out and get the ball rolling.

laimis commented 1 year ago

I put together a draft for the next release; I believe people with commit access should be able to see it in the releases page:

https://github.com/apache/lucenenet/releases

It's pretty meaty.

As the next step, I will review the communication from @NightOwl888 from the previous beta build and see what we need to do to proceed. From the top of my head, there is a PMC vote and then the publishing of the NuGet packages if the vote passes. I also want to set up local tests for handling indexes produced by previous versions to ensure the current version can work and open it, etc.

laimis commented 1 year ago

Just a quick status update, I am going through the steps outlined here to ensure I have all the bits correctly set up locally to do the release.

One thing I am not clear about is the Azure Pipelines and the access that is needed there to make a release. But I haven't gotten to that part yet, so I haven't explored it too deeply.

laimis commented 1 year ago

A quick update. We have sorted out access etc., and are actively working on finishing up a few remaining things that will allow us to push 4.8 beta 17. I can dedicate a decent amount of time now and have been pushing PRs with the remaining fixes. Shad has chimed in as well and has started some work too. No ETA, but we are back up and pushing to the finish line. You can observe our progress here by watching PRs coming in and out: https://github.com/apache/lucenenet/pulls?q=is%3Apr+is%3Aclosed

Our focus areas are 1) fixing the findings by SonarCloud code scans that indicate a problem with converted code where base class constructors call virtual methods that can be overridden in subclasses and cause issues with the state not being initialized properly. 2) Fix Close/Dispose issue with the analyzers #271

And then we will regroup and see where we are at with 4.8 release. There is still some work/considerations to be made about ICU4N. No ETA, as I don't think we can estimate how long this will take, but we can take it as we go and make regular status updates.

laimis commented 1 year ago

We made some more progress. One of the items on the "TODO" before the release, #670 , has been addressed.

I am taking a look at what to take care of next, most likely #271, although lacking a lot of context there but hopefully can find a way to clear it up a bit more.

eladmarg commented 1 year ago

I think we're ready, @NightOwl888 what's your opinion? maybe RC?

laimis commented 1 year ago

I had to go on a month+ trip but back now. I haven't heard much from @NightOwl888 recently, he must be busy with some other commitments. The last piece of work I pushed before taking off was this #852 . It's not entirely clear if I can pull into main what we have there or if Shad was considering more changes to the approach.

Having re-familiarized myself with the project and talking with Shad more about why it's been difficult to make a production release, I think I can give this explanation for it:

The difficulty lies in what to do if we release 4.8.1 and find a bug. OK, we make a patch release, 4.8.2 that fixes that bug. But now, Java Lucene does not have 4.8.2 version. Worse, what if the issue we discover requires a change that's a breaking change, and we in theory would increment the minor version, end up with 4.9.x release which would have API/changes that are not compatible with Java Lucene 4.9.x releases that exist. And with fixes to those we would have releases that don't exist in Java world once again, e.g. 4.9.3.

And I think that's the main issue why Shad has been extremely careful and reluctant to do production releases of the project. We know that bugs are lurking in the code base, but with each pass they are more and more difficult to find, and we can't guarantee that 4.8.1 we release will not require changes.

A careful discussion and consideration is needed here, but one way forward would be to come as a group with the remaining committers that still at least chime in and perhaps draw a line in the sand and say ok, 4.8.1 prod release we are making attempts to be as close as possible to Lucene 4.8.1 release. All releases going forward from that will attempt to stay close at the "major" version but all the minor/patch releases can and will deviate greatly.

I am not proposing this lightly, but it does seem to offer some sort of way forward with making a production release and potentially allowing for a more frequent prod update cadence without keeping ourselves accountable for those versions to be one-to-one mapped to Java world.

rclabo commented 1 year ago

@laimis I think that makes a lot of sense. Given that we haven't previously had an approach for versioning when rolling bug fixes or breaking changes once Lucene.NET 4.8 is released it's very understandable that we have held a very high bar to what needs to be achieved before doing a production release.

I personally think that what you propose as a solution seems reasonable. And who knows, perhaps someone will offer up other solutions that may be even better. But I think as a dev community we need to rally around some versioning approach whatever it is. Having a versioning approach and an understanding of what versions align with Java Lucene and which ones don't will give us the freedom to get Lucene.NET released.

Doing a production release of the library will untie the hands of developers that would love to use it but who are restricted from doing so due to company policies not allowing pre-release software into production environments. Releasing the software will thus grow our developer community and hopefully our committer pool as well. Also, releasing the software will grow the use cases that are actively being utilized and provide valuable feedback on where the library can be improved.

What you propose seems reasonable however it's a bit challenging that this is a release of 4.8.1 rather than 4.0. As such we only have 4.9 as a potential breaking change release, then we hit a major version 5.0. This could cause us to be forced to release a braking change as a point release, say 4.9.1. This challenge of course goes away in the future if the next major release of Lucene.NET has a low minor release number. e.g. 10.3 but we have the same issue in the future if the next major release of Lucene.NET has a high minor release number like 9.7 (the current version of Java Lucene). It's a bit challenging I guess, but we may just have to get comfortable with the idea of a breaking change in a point release. ie. 4.91. (shrug)

laimis commented 1 year ago

@rclabo thank you for chiming in. Curious about this part that you mention:

What you propose seems reasonable however it's a bit challenging that this is a release of 4.8.1 rather than 4.0. As such we only have 4.9 as a potential breaking change release, then we hit a major version 5.0

After 4.9, wouldn't we have 4.10.x as an option? 4.11.x after, etc?

rclabo commented 1 year ago

@laimis That really made me laugh (at myself). You are totally right. For some reason when I wrote that it didn't even occur to me that we could have a 4.10.x! That's pretty funny. Definitely, after 4.9 we can have 4.10.x as an option, and after that 4.11.x. Thanks for being gracious in your question. ;-)

nikcio commented 1 year ago

I've been reading your latest comments about the problems with versioning and a production release and think I have an idea to solve this problem. What if we use the Lucene version as is and then add an extra number to the end to signal the current iteration of the version? That way you still have the consistency of matching the Lucene Java version and .Net version but can still make improvements like bug fixes that weren't caught in a preview/beta phase. (See this image for an example)

Group 2

We also have to remember that no one can be sure that they have bug-free software and that unforeseen problems do come up no matter how long we work on something. So I think it would be better to use iterations instead of neverending beta releases like there have been for a while now with the current 4.8.0 release. This also gives a better signal of when you can use Lucene in production as has been mentioned time and time again that many people do even though it's in beta.

rclabo commented 1 year ago

@nikcio - I think this is a fine proposal and in some ways, I like this approach better because it makes it more clear which version LuceneNET is in rough alignment with, The one thing lost with this approach is the ability to tell, via the version number, if an iteration of a version is a breaking change. But honestly, that doesn't bother me personally a bit. In my case, if I'm upgrading to a newer version of LuceneNET for my project, then I'm probably reading the release notes to see what new goodies it includes. And in that process, I'd be made aware of any breaking changes and the nature of those changes. That's sufficient for me and probably for a lot of devs. However, I know versioning can be an opinionated topic so it will be interesting to see how others on the dev mailing list feel.

rickardp commented 1 year ago

I realize this is a lot bigger topic, but I think the maintainers of this project should seriously consider breaking off from the exact version scheme of the upstream Java Lucene.

As a consumer of this library, naturally I would like to know what API version of Lucene this corresponds to, but that could easily be solved by a version mapping table in documentation.

Examples such as

The difficulty lies in what to do if we release 4.8.1 and find a bug. OK, we make a patch release, 4.8.2 that fixes that bug. But now, Java Lucene does not have 4.8.2 version.

indicate just how hard it is to keep the versions of distinct code bases the same. Especially the patch number is troublesome as that typically designates implementation and bug fixes, but I think the same applies to minor and major.

By releasing yourself from this constraint you would have the flexibility to release stable versions of the functionality that you have implemented without waiting for 100% feature parity with a given upstream Java version.

This way you may opt to never be 100% feature complete with Java Lucene 4.8 (for example), because the community is more in need for some 7.x features that can then be prioritized over the long tail of rarely used 4.8 features (just as a made up example). By following your own version scheme you can instead document version X as "compatible with Lucene 4.8 minus features Y and Z".

It would also possibly be easier to get contributors, as most consumers of a library would rather contribute a PR that just adds a feature from a later version that they need for their application. Sorry to be blunt, but it's going to be very hard to get contributors chasing the last bits 4.8 compatibility.

The additional value is that you can now follow semantic versioning more strictly, something I would argue is an industry standard these days. It would sure make maintaining libraries that depend on Lucene.NET easier.

NightOwl888 commented 1 year ago

First of all, the versioning scheme had been decided some time ago and is in fact documented and made part of the build. At this point I don't see any reason to go back and revisit this scheme which was part of the work that was done during the first 4.8.0 beta.

By releasing yourself from this constraint you would have the flexibility to release stable versions of the functionality that you have implemented without waiting for 100% feature parity with a given upstream Java version.

This way you may opt to never be 100% feature complete with Java Lucene 4.8 (for example), because the community is more in need for some 7.x features that can then be prioritized over the long tail of rarely used 4.8 features (just as a made up example). By following your own version scheme you can instead document version X as "compatible with Lucene 4.8 minus features Y and Z".

This assumes usability and API are the entire issue, but they are not.

Lucene.NET is the most difficult application I have ever had the pleasure of debugging in my 25 years as a developer. When we go off the map like this, we literally throw away our best debugging tool, which is to run the same version of Lucene and Lucene.NET side by side to see where the execution paths diverge. I don't have an answer for how we could debug if we combine different versions of Lucene. Do you?

Furthermore, the binary structure of the index does change from one version to the next, making them incompatible and making it literally impossible to bring many Lucene 9.x features back to Lucene.NET 4.x. We had this issue with back-porting the analyzers-nori package.

We have 100% compatibility with creating an index in Lucene and opening it in Lucene.NET with the same version and plan to keep it that way going forward (and it worked once the other way around, but hasn't been tested in quite a while). The index isn't the only binary format that is also kept in sync between versions.

There are other problems with disjointed versioning between Lucene and Lucene.NET. Case and point: Lucene.NET 3.0.3. There was no release of Lucene 3.0.3. Despite trying to sleuth an answer I have no idea what commit Lucene.NET 3.0.3 is a port of. I could guess that it is a port from 3.0.1 (which actually was released), but I can't be 100% sure. I didn't even know what commit in this repo corresponded to the 3.0.3 release until I found it on an obscure blog (they released 3.0.3 RC2 by renaming it, but didn't make a tag corresponding to the 3.0.3 release). Both of these issues are the primary reason we have never done a maintenance release of Lucene.NET 3.0.3. While we could incorporate the actual version number as part of the InformationalVersion and make it disjointed, it would be very confusing for users who see numbers that overlap Lucene releases that don't correspond to them or their binary formats. Strict version compatibility avoids getting into this situation again.

For usability, there are also issues. Existing Lucene blog posts may not be useful if the API is different than the major version of Lucene the post is about.

The bottom line is there is no maintenance plan for making a Frankenstein version of Lucene that incorporates features from different versions. The best way is to try to sync the entire project to a single Git commit. The story goes way beyond keeping the API in sync. It also means keeping the execution paths, binary formats, tests, and documentation in sync.

While we could simply abandon 4.8.0 and start working on the latest version of Lucene now, we would be stuck in a situation where we have all of the same work to finish we do now plus an estimated 1800 hours of upgrading work. This upgrade estimate could be off if we run into any major gaps that mean more JDK features we need to find or build replacements for. Right now, we are in a situation where our remaining work still has an undefined scope because of gaps that we may not know about. The plan is to try to close all of the gaps so when we finally do start working on the upgrade we have a mostly well-defined scope of work instead of a fuzzy "research this and figure out what we need to do here" situation, where research is often most of the work (meaning to create an issue about it, we need to do most of the work first to define the scope of the issue).

Also, seems like a total waste do to that. Most of the work that is remaining is on ICU4N. I have almost convinced myself that we may be able to release ICU4N as stable earlier by not strictly following the ICU versioning scheme but instead allowing each major release to have breaking API changes until we stabilize it (we are 13 versions behind so we have some wiggle room, but it does mean we will have to do a full upgrade every time we make a breaking API change). But we should probably still conditionally compile out the "draft" APIs and other APIs that are considered unstable in the NuGet package or at least make them invisible to the IDE. There are still other issues to deal with, such as the fact that NuGet doesn't actually deploy resource files for cultures it doesn't recognize. There are many decisions to make like that in ICU4N where there are gaps between Java and .NET. Unfortunately, nobody here seems willing to talk about the actual work that remains. Most want to move on to the next version of Lucene and pretend that we don't need to do this work for the upgrade, anyway.

We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU and components that depend on it unstable, but unfortunately that means either splitting up the lucene-cli component or releasing it as stable with unstable dependencies. I would argue we need to focus 100% on the remaining things that could break the API before we do such a thing (such as automated query parser generation), which could still be time-consuming. It also means we won't have a completely stable 4.8.0 release, the first fully stable release might be something like 4.8.0.17. Or else we would need to setup our build to make separate stable and unstable release packages to comply with the Apache release procedure. And we still wouldn't technically be able to start working on upgrading until we have a stable ICU4N, anyway. I don't see how this improves the situation, it only adds more work to do to make it stable and makes the versioning history more difficult to understand.

It really sucks for us to have to reject what would ordinarily be good ideas from the community, but unfortunately, most of these ideas never take everything into consideration when providing such advice, only the "normal stuff" that most projects deal with.

rclabo commented 1 year ago

Shad, thank you for that. I feel like it just pulled me back into reality.

So I guess what you are saying is we can't have a "stable" Lucene.NET release unless its dependencies are stable and currently Lucene.NET.ICU is a work in progress with a changing API surface.

I'm reading into that, ICU4N, which Lucene.NET.ICU depends on, is also probably a work in progress. And it's certainly worth noting that ICU support is something the Java Lucene team got for free in the JDK that unfortunately isn't included in the .NET Framework (full or core). Hence the need to create ICU4N to provide that support. A nontrivial endeavor in its own right.

In using Lucene.NET to create a search index for an e-commerce marketplace, I've never hit any ICU-related functionality that was missing that I felt I needed. Unfortunately, I have no prior history with ICU so my only learnings about it have been here on the Lucene.NET project. So I guess for me, it's often an out-of-sight, out of mind, portion of Lucene.

But when I review the docs for Lucene.Net.ICU and see what's included, it feels very central to a search library and encompasses such basic functionality as finding word boundaries and line break boundaries. While this seems trivial in languages like English it's anything but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 中国語で単語を区切る方法を理解するのは難しいです.

Given that a great many of the developers using Lucene.NET only use it for English text, or other languages that use the Latin alphabet, it's easy to see how we can sometimes lose sight of what ICU is and why it's so important. Based on your post, I now better understand why Lucene.NET hasn't had a public release yet. Still, it seems very unfortunate that such a stable product (at least for indexing Latin languages) has a current version (beta) that doesn't indicate it's production-ready for Latin languages.

I'm with you a 100% that doing a Frankenstein version of Lucene that incorporates features from different versions. is a non-starter. Being able to compare execution paths with a corresponding Java version is too valuable to give up.

NightOwl888 commented 1 year ago

So I guess what you are saying is we can't have a "stable" Lucene.NET release unless its dependencies are stable and currently Lucene.NET.ICU is a work in progress with a changing API surface.

Not exactly. We could do a release if we go over the API surface of the core and other completed components to finalize it AND build a multi-release scheme so we have 2 different release labels, one for the stable components and one for the unstable components. While the API work is something we have to do anyway, changing the build, release policy, Git labeling scheme, etc. isn't exactly free.

Lucene.Net.ICU will likely change because the CharacterIterator still needs to be converted to a .NETified component and put into J2N (right now it exists in ICU4N.Support, which is meant to go away from the public API). CharacterEnumerator was made for this purpose, but it had to be commented out because I couldn't get it working on Lucene.NET components although it worked fine in ICU4N. This modification will definitely break the public API. I don't think there are any other things that will break it, though.

I'm reading into that, ICU4N, which Lucene.NET.ICU depends on, is also probably a work in progress. And it's certainly worth noting that ICU support is something the Java Lucene team got for free in the JDK that unfortunately isn't included in the .NET Framework (full or core). Hence the need to create ICU4N to provide that support. A nontrivial endeavor in its own right.

Yes, ICU4N is still a work in progress. There are several tests that either still fail, often due to gaps that we haven't yet covered. There are also some concurrency bugs to track down. Since it is only a partial port, we have lots of tests to go through that might be able to be ported, as well. The intention is not to port any more of the production code (except for perhaps some of the formatters and parsers because that is where most of its funding has come from so far).

The ICU4J functionality is not in the JDK. Instead ICU4N is a port of ICU4J. But it is hard to integrate because the gap between Java and ICU4J is not the same as the gap between .NET and ICU4N. Although, it is made easier because ICU is documented pretty well.

In short ICU4/J extend the text processing capabilities of .NET and Java by providing rules-based versions of some of the included components (such as the CompareInfo .NET class which corresponds to the more powerful RuleBasedCollator in ICU4N). These components allow you to control the behavior in custom ways that simply can't be done on the raw .NET or JDK platforms. There are also many other features that are super valuable, such as the UnicodeSet which can be used like a regex character class but is much more powerful (it can even be passed a string to match all of the characters in a specific version of Unicode).

We use the ICU4N BreakIterator in all cases where the JDK BreakIterator is required because .NET is totally lacking this feature (even though it depends on ICU now, the API for this is not exposed anywhere). This has also caused some compatibility issues because of differences between how ICU4J and the JDK behave, so we had to patch the ThaiAnalyzer and basically write our own tests for some of the highlighters. Unfortunately, the highlighters won't work exactly the same unless we do the research to work out what to recommend as the "JDK format" by providing custom rules that correspond to the Java behavior.

But when I review the docs for Lucene.Net.ICU and see what's included, it feels very central to a search library and encompasses such basic functionality as finding word boundaries and line break boundaries. While this seems trivial in languages like English it's anything but trivial in languages like Chinese 要弄清楚如何分解中文單字是很困難的。or Japanese 中国語で単語を区切る方法を理解するのは難しいです.

Given that a great many of the developers using Lucene.NET only use it for English text, or other languages that use the Latin alphabet, it's easy to see how we can sometimes lose sight of what ICU is and why it's so important. Based on your post, I now better understand why Lucene.NET hasn't had a public release yet. Still, it seems very unfortunate that such a stable product (at least for indexing Latin languages) has a current version (beta) that doesn't indicate it's production-ready for Latin languages.

Actually, there are several use cases that even make it valuable even to Western European languages. For example, for removing diacritics from words. In .NET, this cannot be done without a hack because the normalization feature is missing the case fold option that ICU has. I have seen many people post this hack in their questions about Lucene.NET even though they could just use the ICUFoldingFilter or ICUNormalizer2Filter instead.

These make it so words with accent characters such as resume, résumé, and resumé all normalize to the same root word for searches.

Although the components inside of the Lucene.Net.ICU assembly are indeed valuable as is, the real value is in using ICU4N to build custom analysis components.

rickardp commented 1 year ago

Thank you for the really nice and transparent explanation, @NightOwl888! Ultimately, it is down to a fundamental architectural decision on whether this is a line-by-line, version-by-version port of the Java Lucene or if this is a full-text search library based on Java Lucene. This decision is one that would be made by the maintainers, and respected by the users of this library.

While we could simply abandon 4.8.0 and start working on the latest version of Lucene now, we would be stuck in a situation where we have all of the same work to finish we do now plus an estimated 1800 hours of upgrading work

If I read the entire thread correctly, there was never a suggestion to just abandon 4.8, but instead to decide the API is stable and focus on bug fixes, then release 4.8 and figure out a different way to version the library so that API changes can be done later. This way, going from beta to release would mean the current feature set is stable, but without the guarantees of implementing 100% of the APIs of the Java version.

Just to give an example, speaking only from my experience with the library, I personally was not aware of the desire to keep on-disk binary formats the same between Java and .NET. We are only using a subset of all this functionality, and we would definitely not be using the Java version, let alone on the same data. We don't care about Java Lucene at all, we just want a really good .NET full text search engine (actually we don't care about on-disk format at all as we are 100% in memory, but that's a different story).

The bottom line is there is no maintenance plan for making a Frankenstein version of Lucene that incorporates features from different versions

I respect the decision to do a line-by-line port of Java Lucene, but I do like to point out that porting the most relevant features would not necessarily lead to a "Frankenstein" version. Obviously any feature that goes into the codebase have to be well architected and any technical dependencies for this feature have to be implemented properly. But consider if the goal was just to make the best .NET full text search engine out there, maybe omitting the long tail of rarely used features to not have to spend 1800 hours on version 4.8, instead focusing on the most popular features (again, building on robust foundation) may be serving the community better. This could perhaps lead to a higher engagement from the community (in terms of collaboration/PRs and possibly funding). You could still use Java Lucene as a blueprint for the implementation, but with the additional insight in what turned out well and what did not turn out so well there, without being burdened like they have by keeping compatibility also with less used and less well designed features.

We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU and components that depend on it unstable

To be blunt, and in all respect, it might get hard to find funding for hundreds or thousands of dev hours fixing the ICU library to support rare scripts and languages, until someone with a clear business case for it turns up. Just for comparison, if some company needed, say, vector valued fields (just as a random example) they might have the resources to fund the maintainers directly or devote professional developers to work with you on implementing this feature. But since I understand it you want to go to 9.something directly after 4.8, maybe we'll see a lot more contributions coming in as the field will be more open for new features.

but unfortunately that means either splitting up the lucene-cli component or releasing it as stable with unstable dependencies

If you have policies against pre-release libraries this is probably also a no go. I think policies like this are based on the assumption that pre-release means unstable implementation, while you mean unstable API. This is probably the core of this discussion, as it is clear that the code base is very stable from a bugs point of view.

It sounds like you have made a well-motivated and conscious decision w.r.t the versioning policy and the way to integrate new features. Your code, your versioning policy. Thank you for an awesome effort!

NightOwl888 commented 1 year ago

If I read the entire thread correctly, there was never a suggestion to just abandon 4.8, but instead to decide the API is stable and focus on bug fixes, then release 4.8 and figure out a different way to version the library so that API changes can be done later. This way, going from beta to release would mean the current feature set is stable, but without the guarantees of implementing 100% of the APIs of the Java version.

Just to give an example, speaking only from my experience with the library, I personally was not aware of the desire to keep on-disk binary formats the same between Java and .NET. We are only using a subset of all this functionality, and we would definitely not be using the Java version, let alone on the same data. We don't care about Java Lucene at all, we just want a really good .NET full text search engine (actually we don't care about on-disk format at all as we are 100% in memory, but that's a different story).

I respect the decision to do a line-by-line port of Java Lucene, but I do like to point out that porting the most relevant features would not necessarily lead to a "Frankenstein" version. Obviously any feature that goes into the codebase have to be well architected and any technical dependencies for this feature have to be implemented properly. But consider if the goal was just to make the best .NET full text search engine out there, maybe omitting the long tail of rarely used features to not have to spend 1800 hours on version 4.8, instead focusing on the most popular features (again, building on robust foundation) may be serving the community better. This could perhaps lead to a higher engagement from the community (in terms of collaboration/PRs and possibly funding). You could still use Java Lucene as a blueprint for the implementation, but with the additional insight in what turned out well and what did not turn out so well there, without being burdened like they have by keeping compatibility also with less used and less well designed features.

You are making some assumptions that just aren't true here.

  1. You are assuming that we have the high-level knowledge of each component to make such a derivative version.
  2. You are assuming that we would have some way to keep the feature set in line with Lucene if it were not a line-by-line port.
  3. You are assuming that we know which features our users find most valuable. While it is clear that a component such as Lucene.Net.Analysis.Nori (for Korean) will have very limited scope, it isn't so clear for more generalized components such as Lucene.Net.ICU that are useful in a lot more scenarios that Lucene.Net.Analysis.Common simply doesn't cover.
  4. You are assuming that we could get the tests to function the same way in .NET as they do in Java without a line-by-line port. Lucene has a custom test framework that uses repeatable randomized tests. This test framework is upgraded between versions of Lucene along with the tests.

Without keeping the binary formats the same, we would have to recreate all of the corrupt indexes for the tests. Arguably, the index format is the one thing that the Lucene team gave the most thought to about making Lucene portable across programming languages. Granted, we could use the documented format and try to reinvent the wheel for the rest, but there are a lot of components that would have to be analyzed at a high level so they could be recreated.

In addition, Lucene also has pluggable codecs so a newer version of Lucene can read the binary format from an older version so users can upgrade the software first and then upgrade the index later. Maybe you don't use this feature, but for users of apps with high availability, this feature is a must.

There are over 3000 code files in Lucene and it is not documented well - it could easily take years of analysis before we even start writing anything. We wouldn't even have much of an idea which features are important and which are not without tons of analysis and research. And when we are finished, there would be no reasonable way to incorporate features of new versions of Lucene (which is what happened on the NUnit project).

As for upgrading a single feature ahead of where it is in Lucene, this is where we run into problems. We have no idea before porting it what other patches it depends upon and whether any of those depend on binary formats that have changed. So we could start off porting to get the "future" feature in 4.8.0 only to find out later that it is incompatible and all of the work porting that one feature would go out the window. It would take much longer to port Lucene feature by feature than it would be to port the diff between 2 commits to get to a higher version. And we would always be sure to have a version that works (at least as well as it worked in Java).

We could alternatively move on to 4.8.0 release while keeping the Lucene.Net.ICU and components that depend on it unstable

To be blunt, and in all respect, it might get hard to find funding for hundreds or thousands of dev hours fixing the ICU library to support rare scripts and languages, until someone with a clear business case for it turns up. Just for comparison, if some company needed, say, vector valued fields (just as a random example) they might have the resources to fund the maintainers directly or devote professional developers to work with you on implementing this feature. But since I understand it you want to go to 9.something directly after 4.8, maybe we'll see a lot more contributions coming in as the field will be more open for new features.

That is true about funding. But the fact of the matter is that ICU4N has had more funding than Lucene.NET even though it is an alpha with unstable APIs and we still are working out how to properly package it. Maybe it is easier to get people to fund Lucene.NET if ICU4N is a done deal, but Lucene.NET moves on without ICU4N my fear is that ICU4N will never be released.

It is a tough sell to "release" Lucene.NET 4.8.0 and then ask for funding to "finish" it (which is basically to subsidize ICU4N). And it doesn't seem right to sell people on the idea that we are collecting funding for the upgrade only to shift that funding to finish ICU4N. It is far easier to finish ICU4N first, then release it, then release Lucene.NET, then ask for Lucene.NET funding for the 1800 hours to upgrade it (which is a pretty well defined scope).

You are right in that doing it in this order means there is less help on Lucene.NET, but that isn't really where the help is needed until the upgrade anyway. We have analyzed this pretty well and this is by far the fastest path (even though it is taking years because of limited funding and help).

but unfortunately that means either splitting up the lucene-cli component or releasing it as stable with unstable dependencies

If you have policies against pre-release libraries this is probably also a no go. I think policies like this are based on the assumption that pre-release means unstable implementation, while you mean unstable API. This is probably the core of this discussion, as it is clear that the code base is very stable from a bugs point of view.

For the most part, yes. There are a few intermittently failing tests we have yet to track down. We mostly just have several APIs that are likely to break before the release.

Since lucene-cli contains the utilities to maintain the index, it doesn't seem right to make it a prerelease when the rest of the code is a release. But it is a command line app, so it isn't like anyone will depend on it directly. Lucene.Net.ICU is another matter, though. I suspect it is the big companies that will require it most and those companies are the ones that are also most likely to have policies against pre-release libraries.

rclabo commented 1 year ago

You are assuming that we have the high-level knowledge of each component to make such a derivative version. You are assuming that we would have some way to keep the feature set in line with Lucene if it were not a line-by-line port.

These are excellent points. Lucene is relatively easy to use as a library so it's easy not to realize just how sophisticated it is under the hood. It's hands down the most sophisticated software I have ever worked on. The amount of brilliant propeller head thinking that has gone into this product can't be overstated. Some of the best minds in search have contributed to Lucene. It's truly an amazing piece of software. And making changes to its internals is not for the faint-hearted. :-)

rickardp commented 1 year ago

There are over 3000 code files in Lucene and it is not documented well - it could easily take years of analysis before we even start writing anything.

Points like these really sold your "line by line" approach to me. The (incorrect) assumption that I made was that most/all of the contributors and maintainers are as familiar with the (Java) Lucene codebase as the core Lucene devs, or the degree of communication between the projects. Admittedly, this was an assumption I made without looking it up. If this is not true, then any other approach would fail, agreed.

rclabo commented 1 year ago

Just to clarify, Lucene has a lot of documentation, and Lucene.NET has it's flavor of that documentation as well. By many standards, it's decent documentation. But it's one thing to document how developers can use an expansive library like Lucene, and quite another to document why each design choice was made the way it was and how the specific implementation details of that design enable the insanely fast overall indexing and search speeds of Lucene.

There are many small aspects of the system that use such advanced software engineering approaches that a dev could easily spend more than a month if they wanted to understand that aspect of the system deeply. Lucene's use of automata is one example. Here is a video at a conference that does a high-level overview of how and why Lucene uses automata. If a dev wants to understand automata they will need to watch videos like that one and ultimately hunt down the whitepapers. Once those whitepapers have been digested, maybe the dev will have the ability to understand that portion of the code. Maybe. We are assuming a very senior dev here.

A dev is not going to find deep documentation on automata in Lucene's source code or external documentation. (shrug) There is, of course, the Lucene dev mailing list archive, an archive of completed issues, and PR notes. All three of which contain a fantastic amount of history and insights.

Jeevananthan-23 commented 1 year ago

As for upgrading a single feature ahead of where it is in Lucene, this is where we run into problems. We have no idea before porting it what other patches it depends upon and whether any of those depend on binary formats that have changed. So we could start off porting to get the "future" feature in 4.8.0 only to find out later that it is incompatible and all of the work porting that one feature would go out the window. It would take much longer to port Lucene feature by feature than it would be to port the diff between 2 commits to get to a higher version. And we would always be sure to have a version that works (at least as well as it worked in Java).

Indeed explanation, so when I was working on adding the Sequence Number feature found the same issue really uncertain about the Lucnenet roadmap. @NightOwl888 / @rclabo can anyone list the issues where I can work on to me open to contributions focusing on production-grade features.

superkelvint commented 7 months ago

Furthermore, the binary structure of the index does change from one version to the next, making them incompatible and making it literally impossible to bring many Lucene 9.x features back to Lucene.NET 4.x. We had this issue with back-porting the analyzers-nori package.

We have 100% compatibility with creating an index in Lucene and opening it in Lucene.NET with the same version and plan to keep it that way going forward (and it worked once the other way around, but hasn't been tested in quite a while). The index isn't the only binary format that is also kept in sync between versions.

@NightOwl888 I am a Lucene Java programmer myself and am happy to help in any efforts to maintain two-way compatibility between Lucene and Lucene.NET.

turowicz commented 3 months ago

Any way of getting updated nuget packages from master or do we need to build on our own?

NightOwl888 commented 3 months ago

Any way of getting updated nuget packages from master or do we need to build on our own?

@turowicz - We are working on a Lucene.NET beta release now. And we could use some help as there are several tasks to complete before the release. This is still a work in progress, but I have a milestone setup for J2N with some open tasks: https://github.com/NightOwl888/J2N/milestone/3.

The plan is to roll out a release of ICU4N and there are several tasks to work on in this project as well.

turowicz commented 3 months ago

does it mean just building the master from source is not a good idea then? for .net8

NightOwl888 commented 3 months ago

You are welcome to build from source. We don't currently have a target for net8.0 so there may be some build issues to work out when the target is added. We will be adding it before the release, though.

It seems that help is coming from some PMC members to work on the release, but they are not available immediately. So, between that and the fact that we have to have a release vote (which takes 3 days), it could be 2 - 3 weeks or so until a release is available on NuGet. Unless of course we get more volunteers.

NightOwl888 commented 3 months ago

Furthermore, the binary structure of the index does change from one version to the next, making them incompatible and making it literally impossible to bring many Lucene 9.x features back to Lucene.NET 4.x. We had this issue with back-porting the analyzers-nori package. We have 100% compatibility with creating an index in Lucene and opening it in Lucene.NET with the same version and plan to keep it that way going forward (and it worked once the other way around, but hasn't been tested in quite a while). The index isn't the only binary format that is also kept in sync between versions.

@NightOwl888 I am a Lucene Java programmer myself and am happy to help in any efforts to maintain two-way compatibility between Lucene and Lucene.NET.

@superkelvint - Sorry for the late reply. I didn't see your comment back in April.

Thanks for offering to help with compatibility. One way you might be able to help us is to add support (even if it is unofficial) to the latest version of Lucene to read 4.8.0 codecs. The backwards-codecs package only goes back to Lucene 5.x.

Our plan is that once Lucene.NET 4.8.0 is stable to jump ahead to the current version, so it would be beneficial if Lucene.NET users could upgrade the software first and upgrade their index at some later point. It would save us some time if we didn't have to grab the 4.x codecs from the last version that supported them and try to splice them into the backwards-codecs package, as offhand I don't really know what is involved.

Another way you could help us (since you linked to the issue) is to provide some guidance on the analysis-nori module (latest work here). We got most of it working, but there are 3 test failures that were difficult to find an answer for. The tests are TestRandomHugeStringsMockGraphAfter, TestUserDict, and TestLookup. The biggest issue is that it is ported from Lucene 8.2.0 and the FST implementation has completely changed. I tried recreating the UserDictionary with our ported code, but the UserDict test still doesn't pass. I also tried porting over the earliest version, but FST had changed before then.

Now, since the kuromoji module is almost identical and it runs on 4.8.0, I suspect there is a solution. I have already asked the Lucene team, but their advice was just to wait until we upgrade. However, if we have someone who is willing to help us find a solution, maybe we can make this available sooner.

turowicz commented 3 months ago

You are welcome to build from source. We don't currently have a target for net8.0 so there may be some build issues to work out when the target is added. We will be adding it before the release, though.

It seems that help is coming from some PMC members to work on the release, but they are not available immediately. So, between that and the fact that we have to have a release vote (which takes 3 days), it could be 2 - 3 weeks or so until a release is available on NuGet. Unless of course we get more volunteers.

Happy to wait for 3-4 weeks for a nuget release. Unfortunately I can't contribute.

turowicz commented 2 months ago

@NightOwl888 3 weeks have passed, how's it going 😄

rclabo commented 2 months ago

I can't speak for the group, but here's my perspective. I had hoped to help out some starting a week ago but got bogged down in another task at work that has a hard deadline with big consequences for missing. So I'm probably another week out before I can help at all. @NightOwl888 is amazing, but many hands make light work, and right now, we appear to be short of hands. So I wouldn't be surprised if it takes a few months to get the next version out. We'll see.

turowicz commented 2 months ago

In the meantime, how one should go about using the latest version? How to tell what is "stable"?

NightOwl888 commented 2 months ago

We generally keep the master branch in a releasable state, so it is best to build from there. You can either follow the build instructions on the README, or download the nuget artifact from one of the nightly builds.

We are getting some traction on this release, but there are still many tasks to complete. Most importantly, getting J2N and ICU4N in a releasable state and fixing the API doc generation.

rclabo commented 2 months ago

In the meantime, how one should go about using the latest version? How to tell what is "stable"?

@turowicz you may have a specific reason that you want to build from master or to get nugets from the nightly build, and if so, that's cool . But it's probably worth mentioning that Lucene.Net 4.8.0-beta00016 at nuget.org is also very stable and it's what I'm currently using.

turowicz commented 2 months ago

@rclabo I need to upgrade to .NET 8 and for that I need the latest version.

Due to this: https://github.com/apache/lucenenet/issues/933

paulirwin commented 2 months ago

re: blockers for the next beta release, I've been looking into #911 and upgrading DocFx to the latest version, but it is not a cakewalk. I've made good progress and can mostly build docs without plugins using the combined docfx json file, but the individual files don't work with the TOC yet. At some point we might have to decide to push out a new beta release without docs (or with half-broken docs), unless anyone with experience with this can jump in and help me get this across the finish line.

Shazwazza commented 2 months ago

I had put together the docfx docs together before. IIRC if you update to latest docfx, it might not support the old plugin system which we need. I'll try to find time next week to see what I can do.

paulirwin commented 2 months ago

@Shazwazza Thanks, I'd appreciate any assistance! It indeed does not support the old plugins, but those just replace markdown elements with things like environment variables, so I'm punting on that for now and saving that for the end. So currently I'm just trying to get the docs to build correctly without plugins enabled, and that's causing issues. It's complaining about a duplicate toc.yml file, which Shad told me you were able to work around previously. I think this is a large part of why the individual docfx json files (scripted via docs.ps1) fail to generate correct output for the individual library "sites." You can check out my progress here so you don't have to start from the beginning: https://github.com/paulirwin/lucene.net/tree/issue/911 - I've improved the docs.ps1 script to install the latest verison of docfx as a global tool, fixed a lot of doc build warnings, etc.

Running the docfx global tool manually on the combined file generates a site that mostly works (again without plugins yet, so env vars don't work), but if you clean your repo and run docs.ps1 you'll see that the site is broken for all of the libraries off the main site.

Shazwazza commented 2 months ago

@paulirwin ok sounds good, ideally we're running latest docfx for sure so if we can get there, that will be the best bet. Converting the docs is a pretty crazy challenge, especially when we have to deal with the craziness of docfx too - though believe it or not, its less crazy today :P I'll try to make time next week and let you know. Have pinged you now on Slack

nikcio commented 2 months ago

Hey @Shazwazza and @paulirwin I had a little look at what you've been doing and added a bit to it. I've added a description in this PR of what progress I've made: https://github.com/apache/lucenenet/pull/958

Just close the PR when you're done using it I don't think I have the correct knowledge to add what is still missing 😅.

paulirwin commented 2 months ago

An update on this thread, #961 gets the docfx build working again and has the added bonus of using the latest docfx, so we're now up to date and can even build docs cross-platform. Thanks again to @nikcio, @Shazwazza, and @rclabo for the assitance!

I'm now looking into helping @NightOwl888 with ICU4N and J2N as he needs to help get that done.

sgf commented 1 month ago

In all of this, one thing you should keep in mind is that many people (even Microsoft!) currently use portions of Lucene 4.8 in production. So that is to say the product is already very stable.

But more .NET system (even Microsoft!) have already chosen elastic search(lucene),or meilisearch.

lucenet.net Just like the elderly, move too slowly and need to pick up the pace.

paulirwin commented 1 month ago

Let's please leave out comments like that about elderly people. However, I understand and sympathize with the sentiment of the rest of your message. Those of us that are actively working on this project are currently giving it all we can to get this done.

To provide an update which is due anyways, we are nearly done with preparing the next release. The ICU4N and J2N releases to support .NET 8 and satellite assemblies are done and dependencies upgraded, so that we can now target .NET 8, which is PR #928 that was just merged today. This should hopefully be the last major item for the beta 17 release. We are reviewing now to see if there are any remaining issues that need to be included. This will be our first beta release in a couple years and we're eager to get it done, but we also want to get any breaking changes out of the way all at once.

After the beta 17 release, my personal goal is to help us get to a final "RTM" release of 4.8 this calendar year if at all possible. We'll need the community's help in testing the beta 17 release to make sure it's solid. This means not only filing bugs, but letting us know in the affirmative that it's working well for your workload so that we can have some confirmation of our work and how it is being used. Please do not jump in to port post-4.8 features at this time; any PRs for these will likely be closed and your efforts might be wasted. Likewise, we apologize but we are not accepting significant architectural changes (library structure, .NET version, C# language features, etc.) at this time. We can't have this be a moving goalpost; we need to focus on getting the release done.

Stay tuned, we're pushing hard, and we'll hopefully have a beta 17 release ready for testing very, very soon.