apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.2k stars 633 forks source link

When is the release version planned? #437

Open VollyBird opened 3 years ago

VollyBird commented 3 years ago

Due to some annoying rules, we have to use release version. So, when is the release version planned?

eladmarg commented 3 years ago

the version is very mature but still not the final release. I guess we'll have 3-4 more betas before the final one.

NightOwl888 commented 3 years ago

Hi. Sorry for the late reply. Our goal is to have a release candidate by the end of 2021.

Note that we are still analyzing to make sure we are aware of all of the issues and gaps that need to be finished before the release as well as stabilizing APIs and dependencies such as ICU4N and J2N. Although we have all of the modules ported except for a few features, there are likely some issues that are yet unknown which will come up during review and it is not possible to predict whether they are blocking the release or how long they will take to complete.

benjamin-stern commented 2 years ago

We are getting pretty close to the end of 2021 any updates?

NightOwl888 commented 2 years ago

We are working on 4.8.0-beta00016 at present and will likely be releasing it within the next couple of weeks.

While we don't consider Lucene.Net.ICU and Lucene.Net.Analysis.OpenNLP to be blocking the release, it would take a considerable amount of effort to be able to release them on a different schedule as the rest of the components (probably around 30% of the work as it would take to just complete them). Several components depend on ICU4N, including lucene-cli which also has index maintenance tools that should be version synced with Lucene.NET.

We are always looking for additional people to help us reach the goal. Per community request to provide more info about what to work on, we moved from JIRA to GitHub issues and marked several issues up for grabs. While we are committed to completing Lucene.NET, if you are waiting for us to complete it with the small number of resources we have, make no mistake about it you will be waiting. On the other hand, we have recently added some information on the NuGet Readme page on how to get involved and/or sponsor the project if you wish to help us to make Lucene.NET 4.8.0 a production release faster.

benjamin-stern commented 2 years ago

Unfortunately the company I am working for is quite small and doesn't have the extra resources to assist with the development in any substantial way.

Thank you for all your hard work and detailed drilldown, I think it important that the roadmap be visible so that individuals or companies with the resources, can assist and with the development where they can.

NightOwl888 commented 2 years ago

any substantial way

I guess this is the crux of the misunderstanding. We don't need a few companies to put in tons of time or funding, we need many companies to put in a small amount of time or funding to keep the project going.

We are getting more than 3400 downloads per day on NuGet, which is significantly higher than it was in 2016 when I started working on this (around 600 per day). If only a fraction of the companies and individuals doing those downloads would put in 1 weekend, or 1 day, or 1 hour of time or were financially contributing $50, $20, or $5 per month, we would be moving significantly faster than we are now (mind you, porting on Lucene.NET 4.8.0 started in September 2014).

For most small businesses, $50 per month is not going to break their budget. But for us, it is a crucial lifeline to completing the port.

Shad Storhaug Sponsor with GitHub Sponsors Sponsor with PayPal
Shannon Deminick Sponsor with GitHub Sponsors
Ron Clabo Sponsor with GitHub Sponsors

I think it important that the roadmap be visible so that individuals or companies with the resources, can assist and with the development where they can.

Thanks for the feedback.

Creating such a list is a drain on our already small number of resources. Porting work isn't quite the same as developing new applications where the requirements can be well-defined. There is a significant amount of analysis and research that goes into finding the best API or technology to map to on the new platform, and if it doesn't exist or behaves radically differently, we have to build it. What was done initially was a "best guess" for how it should fit together, but in some cases the wrong technology was picked and in others a significant performance degradation happened or unintentional bug was introduced because the new API differs in behavior from the original one. For many of the "tasks" that we are aware of to complete, the analysis has not yet been done. In other words, they are just a general "analyze this and determine how to break this into more tasks".

Of course, crowd sourcing the analysis work would be a big help. If everyone reading this spent just 30 minutes to pick a random file and compare the Lucene.NET source line by line against lucene 4.8.0, Googling to verify that each line they are unsure of is ported in a reasonable way, and made us aware of all of the uncommented differences, we would have a leg up on the analysis work.

Most of what has been determined has already been posted above or as a GitHub issue. There is a spreadsheet that exists to keep track at a high level, but it is a significant amount of work to keep it up to date. I have made an effort to put some of the more well-defined tasks into GitHub issues that would take anywhere from 30 minutes to a weekend to complete, but despite some of them being on the board for 3 years, I am the one who ends up closing them. I am just not convinced that spending all of my time to pre-analyze everything and update a list is going to bring more help. We have tried this to some degree, and it has not worked. Even for basic stuff like creating a wrapper batch file, creating icons for our NuGet package dependencies, and updating the theme of the code colorizer on our website.

As part of #460, I got involved with the effort to revive IKVM. On that project, people offer to help out frequently. But since IKVM was abandoned by its original contributor and has some confusing native bits, it is difficult to navigate through how to help. I can see how creating a list could help out on that project. But for some reason, despite Lucene.NET having nearly 13 million downloads on NuGet and a steady increase in demand over time, we aren't getting many offers for help to complete the port.

NOTE: We have a plan to upgrade to the latest version of Lucene once the Lucene.NET 4.8.0 port is complete. The Lucene design hasn't changed significantly and the dependencies (ICU4N and J2N) now exist in .NET, so we can do it in around 1800 hours rather than the 5000+ hours that it took to get from 3.x to 4.x.

alexhiggins732 commented 2 years ago

Of course, crowd sourcing the analysis work would be a big help. If everyone reading this spent just 30 minutes to pick a random file and compare the Lucene.NET source line by line against lucene 4.8.0, Googling to verify that each line they are unsure of is ported in a reasonable way, and made us aware of all of the uncommented differences, we would have a leg up on the analysis work.

Of course, crowd sourcing the analysis work would be a big help. If everyone reading this spent just 30 minutes to pick a random file and compare the Lucene.NET source line by line against lucene 4.8.0, Googling to verify that each line they are unsure of is ported in a reasonable way, and made us aware of all of the uncommented differences, we would have a leg up on the analysis work.

If this is indeed the approach, then perhaps splitting the work up into buckets with todos for people to take up pieces as part of a roadmap would be helpful. Unfortunately there is no visibility on this project and citing reliance and a complete rework of a now alpha version of IKVM means this project will not likely see a release for a very long time.

Additionally, I just spent 30 minutes verifying what you suggest is not feasible. There is no one to one file map between the repos. There file and folder structures structures. Perhaps if this repo were reorganized then tools like win merge would at least facilitate a file by file mapping.

Then digging through to fine files there is also not a 1 to 1 line mapping within the files. For example

Line: 192: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/WikipediaTokenizer.cs

           int tokenType = scanner.GetNextToken();

            if (tokenType == WikipediaTokenizerImpl.YYEOF)
            {
                return false;
            }

Line: 192 https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.0/lucene/analysis/common/src/java/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.java

    String type = WikipediaTokenizerImpl.TOKEN_TYPES[tokenType];
    if (tokenOutput == TOKENS_ONLY || untokenizedTypes.contains(type) == false){
      setupToken();
    } else if (tokenOutput == UNTOKENIZED_ONLY && untokenizedTypes.contains(type) == true){
      collapseTokens(tokenType);

So if we can come up with a strategy to synchronize the file/folder structure and then reorganize the C# files line by line to match the java, we can take the suggested approach.

Alternatively, there is the triage approach of focusing on producing release version(s) of the core product(s). This could be code reviewed and released first while external dependencies which have no visibility or will take a long time to complete (like nlp) can be moved to separate pre-release packages. Additionally similar decisions can be made about other parts of the code base that aren't 'core functionality. For example, given the the tokenizer is likely now out of date being over 8 years old with github warning This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository perhaps even parts of shared namespaces can be separately packaged.

alexhiggins732 commented 2 years ago

Didn't realize Lucene was on version 9.1 already. This is just a beta version of 4.8,which was released in on Apr 27, 2014, nearly 8 years ago. Looking at the current features and changes, the CSharp codebase is leaps and bounds behind. IMHO, I am highly skeptical this will ever reach be brought on par, Or that we this will see production grade release within the next few years, if ever. With this project being blocked by much larger projects like IKVM, which are estimated at 30% of the outstanding work, and the numerous unaddressed bugs in the codebase (estimated at ~70% of the remaining work), it is unfortunate to say with all of the incredible work that has been done but this repo has the hallmarks of open source projects left on the vine to die. But who knows?

NightOwl888 commented 2 years ago

There is no one to one file map between the repos. There file and folder structures structures. Perhaps if this repo were reorganized then tools like win merge would at least facilitate a file by file mapping.

The file-by-file mapping is accurate (in fact, we have kept the file names the same even if we ended up renaming the type inside the file to follow .NET conventions), but since we are building a .NET application and not a Java application, the deep folder structure has changed to move the files closer to the top level of the project.

For example, the files in the https://github.com/apache/lucene/tree/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/index directory exactly correspond to the files in https://github.com/apache/lucenenet/tree/d6f3c3e7aad1847f5df69e4c080f45dad318a3ad/src/Lucene.Net/Index.

But let us know if you find any files that were renamed or were added in the Java-ported directories. All of the files that we have added are supposed to be under the Lucene.Net/Support directory.

The Lucene.Net.Core project was renamed to Lucene.Net because on early prerelease versions of .NET Core the folder name had to be the same as the project, but other than that one exception, the convention of Lucene.Net.<package> was followed consistently.

Then digging through to fine files there is also not a 1 to 1 line mapping within the files. For example

Line: 192: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/WikipediaTokenizer.cs

      int tokenType = scanner.GetNextToken();

       if (tokenType == WikipediaTokenizerImpl.YYEOF)
       {
           return false;
       }

Line: 192 https://github.com/apache/lucene/blob/releases/lucene-solr/4.8.0/lucene/analysis/common/src/java/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.java

String type = WikipediaTokenizerImpl.TOKEN_TYPES[tokenType]; if (tokenOutput == TOKENS_ONLY || untokenizedTypes.contains(type) == false){ setupToken(); } else if (tokenOutput == UNTOKENIZED_ONLY && untokenizedTypes.contains(type) == true){ collapseTokens(tokenType);

When I said "pick a random file", I didn't mean anywhere, I meant specifically in the Lucene.Net project. Lucene.Net.Analysis.Common was ported from 4.8.1, which included a few minor changes like some guard clauses.

Of course, due to the differences in using statements, file headers, coding style, etc., we don't expect that every line in a file will correspond to the same line number as it was in Java. As you can see, the Java line 192 corresponds with line 198 in .NET. However, unless commented otherwise (or at least it should be) we have generally kept the order of members in the same order, unless there is no sensible way to port the functionality in .NET.

That being said, many types were de-nested to make them more easily discoverable in .NET, but we always ensure those types remain in the same file as the Java code.

So if we can come up with a strategy to synchronize the file/folder structure and then reorganize the C# files line by line to match the java, we can take the suggested approach.

Not sure what we expect to gain by this. This is not an automated port, but mostly a manual one. The parts that were automatically converted with a tool had major problems that we had to go back and rework that took longer than simply porting manually. We are still finding some breaking problems, such as incorrect index file naming that are keeping us as a pre-release, and the only way to track them down is by analyzing and scrutinzing the code.

Alternatively, there is the triage approach of focusing on producing release version(s) of the core product(s). This could be code reviewed and released first while external dependencies which have no visibility or will take a long time to complete (like nlp) can be moved to separate pre-release packages. Additionally similar decisions can be made about other parts of the code base that aren't 'core functionality.

Right, that is the 30% of the remainder that I referred to before. We would have to change the build to allow us to do partial releases on different schedules and fit it in with the Apache release policy to allow us to do separate release votes for each batch of components we are releasing.

NLP is a non-issue, really, as only the Lucene.Net.Analysis.OpenNLP project depends on and the public API of Lucene.Net.Analysis.OpenNLP probably won't need to change when support for .NET Core is stable (do note there was recently a patch https://github.com/ikvm-revived/ikvm/pull/46 that makes the .NET Core build function, but I haven't had a chance to work on the build pipeline to release it). At the end of the day, if it is not ready when Lucene.Net is ready for release, we will move on without it (as was pointed out in #460).

The main blocker is ICU4N. Due to the fact that .NET has no built-in BreakIterator or anything like it, we ended up using the ICU4N BreakIterator as a base class for everything that requires it. So, instead of a Lucene.Net.Analysis.ICU project, we have a more general Lucene.Net.ICU project that contains highlighters and the Thai analyzers in addition to the ICU analyzers.

In turn, Lucene.Net.Analysis.SmartCn, Lucene.Net.Benchmark, and lucene-cli depend on Lucene.Net.ICU, which would all need to remain unstable if we went that route. The stable parts of lucene-cli could be released, but only if we break it up into 2 separate tools, which is really not a road we want to go down, if possible.

For example, given the the tokenizer is likely now out of date being over 8 years old with github warning This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository perhaps even parts of shared namespaces can be separately packaged.

You are receiving that warning, because recently Lucene split off SOLR into a separate repository, so they did some reworking of the repo history to keep it in place on both projects.

The tokenizer is 8 years old, true, but it doesn't get many updates and is not something you are likely to notice.

Although the documentation (incorrectly) states that Character classes in .NET Regex use Unicode 8.0, I have used ICU4N's UnicodeSet to analyze it, and it actually only supports Unicode 3.0.1 characters (the version that was released in August, 2000 just before .NET 1.0 was released). Has this had any impact on your usage of Regex?

If you do need more up-to-date Unicode, do note that ICU4N is ported from 60.1 (we started when it was still a release candidate) which supports up to Unicode 10.0. Lucene 4.8.0 depended on ICU4J 52.1.

Alternatively, the analyzers are relatively painless to port from newer versions of Lucene. The components in the Lucene.Net (core) assembly are by comparison non-trivial to port.

Didn't realize Lucene was on version 9.1 already. This is just a beta version of 4.8,which was released in on Apr 27, 2014, nearly 8 years ago. Looking at the current features and changes, the CSharp codebase is leaps and bounds behind. IMHO, I am highly skeptical this will ever reach be brought on par, Or that we this will see production grade release within the next few years, if ever. With this project being blocked by much larger projects like IKVM, which are estimated at 30% of the outstanding work, and the numerous unaddressed bugs in the codebase (estimated at ~70% of the remaining work), it is unfortunate to say with all of the incredible work that has been done but this repo has the hallmarks of open source projects left on the vine to die. But who knows?

In 2016, when I started working on Lucene.Net it was being downloaded around 600 times per day. Now it is being downloaded 3600 times per day according to NuGet. It is one of the top 250 packages on Nuget.org. That doesn't seem to be the hallmark of a dying project.

ICU4N is probably around 70% of the remaining work (it would be more if we were planning to port more of it, but the plan is to fix the gaps and failing tests we currently have and mark components internal that don't have stable APIs without porting any additional components). We are not even including IKVM in our estimates as it is not blocking the Lucene.Net release.

What I said previously was it would probably take 30% of the total amount of work remaining (that is roughly half of the amount of effort it takes to finish up ICU4N) to break up Lucene.NET into multiple segments that we can release separately, which would effectively defer the work on ICU4N until later.

I haven't pushed the changes yet, but the biggest ICU4N issue of moving the embedded resources into satellite assemblies that end users can delete if they are not using them is nearly completed. In addition, we have gone from 6 NuGet packages down to just 1 code package and 1 data package, as was done in ICU4J.

As @rclabo has just documented in the new quick start section that will soon be on our website, more than 90% of the features that changed in Lucene were between 3.x and 4.x. Since then, there have only been a couple of dozen new features and a handful of new modules. He also goes into an overview of how we ported it, so it is definitely worth a read.

Unless Lucene decides to have a major redesign again, this will be the last full port of Lucene.NET. Once it is stable, we will simply upgrade it by porting only the changes in each changed file to the latest version. It has been a while since I have checked, but when I did (I believe it was version 7.2.0) around 80% of the files have had less than 10 changed lines since 4.8.0. Most of the changes have been to the structure of the index.

We have just 5 tests that are marked with the [AwaitsFix] attribute, and the causes of the problems are now fairly well understood (just not quite so simple to fix immediately). Adding the support for repeatable randomized testing that was missing for so long has made debugging much easier, and we have added Source Link support so you can now step into the code in our repository. Finding a workaround for the lack of enumerator on .NET Framework for ConditionalWeakTable<TKey, TValue> was a major victory and our .NET Framework support is now more stable than 3.0.3 will ever be. And the list of gaps to close is getting smaller.

Changing our target to 9.1 now only means we add ~1800 extra hours of work on top of few hundred hours of work that remains. It also robs the community of a stable 4.8.0 release that will work on .NET core for the amount of time it takes to do the upgrade. We still have the same unstable dependency issues we have now, plus more work on new dependencies that Lucene has taken on since 4.8.0. We would also have to figure out a way to make Lucene 9.1 read Lucene.Net 3.0.3 indexes to allow people to upgrade the software first and the index later, which is something that doesn't exist in Lucene 9.1 (although, it is fairly easy by comparison to port the backwards-codecs support for 4.x since it is tested independently from everything else), where the Lucene 3.x tests are integrated directly into Lucene.Net.TestFramework.

Of course, there is benefit to doing the upgrade to 9.x after we have a stable 4.8.0 release, it just doesn't make much sense to do it first, especially when most of the gaps of porting to 4.8.0 have already been worked out, all of the modules have been ported, and all but 5 of the tests are passing.

rclabo commented 2 years ago

@alexhiggins732 I agree with @NightOwl888, it’s much better to fully complete version 4.8 before working to upgrade it to whatever the current version of Java Lucene is at that time.

You mentioned:

Didn't realize Lucene was on version 9.1 already. This is just a beta version of 4.8, which was released in on Apr 27, 2014, nearly 8 years ago. Looking at the current features and changes, the CSharp codebase is leaps and bounds behind.

I can understand how the casual observer might reach that conclusion. In fact when I first discovered Lucene.NET 4.8 Beta a couple years ago I too wondered whether its feature set was current enough to be of value.

But the more I dug into the project the more I was blown away by Lucene.NET 4.8’s power and the advanced engineering it contains. It’s truly a remarkable piece of software. It’s architecture and features are as relevant today as ever.

I know that people see the 4.8 version number, Beta status and length of time it’s taken to port it and they wonder how relevant it is. That’s understandable And that’s the reason I recently I wrote a blog article Lucene.NET 4.8 vs Java Lucene 9.x to help people realize that Lucene.NET 4.8 has a wealth of features and the majority of the features of Java Lucene 9.x.

Release Schedule Analysis

Besides what I mention in the blog article, it’s important to realize that the time period between major releases was much longer during the version 1.x to version 4.8 era. And since then, the Lucene team, like most software teams, has moved to doing smaller more frequent releases.

Let’s look at the release timeline of Java Lucene.

2000 – First open source version of Lucene. 2002 – Lucene 1.2 released Under Apache License 2003 – Lucene 1.3 released 2006 – Lucene 2.9 released 2009 – Lucene 3.0 released 2012 – Lucene 4.0 released
2014 – Lucene 4.8 released 2015 – Lucene 5.0 Released 2016 – Lucene 6.0 Released 2017 – Lucene 7.0 Released 2019 – Lucene 8 Released 2021 – Lucene 9 Released

Source: https://www.elastic.co/celebrating-lucene#2020

I don’t know how many years Doug Cutting worked on Lucene before making it open source in 2000. But ya gotta guess it was at least two years. Now look when Lucene 4.8 was released, in 2014. So at the release of Lucene 4.8 there was probably 16+ years of development behind it. That’s not labor years, just calendar years. Think about that for a minute.

But how many years difference is there between Lucene 4.8 to Lucene 9? 7 years. Now 7 years is a lot, I grant you that. But it’s important to remember that the single biggest new set of features ever to be introduced in the history of Lucene came in version 4.0. That’s why it took more than 3 years for the Java team get to the 4.0 release even with a large team.

In our case 8 years have passed in going from Lucene.NET 3.03 to 4.8. But our team is much smaller and we don’t currently have any corporate backing (Java Lucene has LOTS of corporate backing).

Why even compare to Java Lucene?

Honestly, I don’t see a lot of point in comparing Lucene.NET to Java Lucene, unless you are equally happy to use a Java library as a .NET one. If that’s the case, the comparison is valid and you should seriously consider using Java Lucene 9.1. But if you are a .NET developer developing a .NET application, website or mobile App then the comparison isn’t really between Lucene.NET 4.8 and Java Lucene 9.1, it’s between Lucene.NET and other .NET based search libraries.

What I see when I look at Lucene.NET 4.8

So when I look at Lucene.NET 4.8 I see something totally different than what you see.

I see a software architecture that was initially created by someone (Doug Cutting) who was creating his “fifth search engine, having previously written two while at Xerox PARC, one at Apple, and a fourth at Excite.” source. Hat’s off to Doug for sharing this with the world. Wow!

I see an insanely large feature set hammered out by numerous developers over a 16+ year timeframe.

I see more than 644K+ lines of code (not counting dependencies!) that have been ported to c#, and run on .NET on Windows, Linux or MacOS.

I see a powerful search library that can be used to search enable desktop applications, websites or mobile apps (Android or iOS).

I see a multi-targeted search library that runs on the .NET Full Framework, .NET Core 3.1 LTS, .NET 5 or even the latest and greatest .NET 6.

I see a project that while calling itself beta because a few method signatures may still change, has 7800+ passing unit tests and is clearly production worthy right now in my mind.

Varorbc commented 1 year ago

@NightOwl888 any update?