apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.24k stars 638 forks source link

NLP Support (OpenNLP) #460

Closed NightOwl888 closed 9 months ago

NightOwl888 commented 3 years ago

I don't know if this is an issue or a discussion yet, but it seems logical to document this somewhere in case we make it to release with gaps in support for NLP.

First of all, Lucene 4.8.0 didn't support Apache OpenNLP, it supported Apache UIMA. So, we picked a newer Lucene version (8.2.0) and did what it did, choosing OpenNLP instead of UIMA (which is seemingly now part of the OpenNLP package). This was primarily because there were no options available for UIMA in .NET, but some options for supporting OpenNLP and it didn't make sense to do the work to support UIMA in .NET that would only last through 1 Lucene.NET major release.

Options for NLP Support in .NET

Option Issues Notes
Port OpenNLP from version 1.9.1 tag to .NET
  • The project is large and would take a lot of effort to port and maintain.
Use AlexPoint/OpenNlp from NuGet
  • API has been refactored significantly from OpenNLP and would take a high-level analysis to use the new API
  • It isn't clear what version of OpenNLP this is as the version number doesn't seem to track the one in Java, but it is probably long before 1.9.1 and seems to be missing features Lucene uses
  • Currently only supports .NET Framework 4.5+
Use Standford NLP.NET
  • The API is significantly different from OpenNLP and it would take a high-level analysis to determine whether it has the features we need
  • It is an IKVM port, which currently only supports .NET Framework 3.5
  • Its GNU2 license is too restrictive to use in an Apache project (we can depend on, but not import code)
There is a project called Tweet NLP that extends it and seems to supply much of the functionality Lucene uses
Use AboditNLP from GitHub
  • A high-level analysis is required to determine if it supports the functionality Lucene uses.
  • Closed-source, only demos and the NuGet package are available.
Targets .NET Framework 4.7.2, .NET Standard 2.0, and .NET Standard 2.1.
Use CherubNLP from NuGet
  • Would require a high-level analysis to determine if it supports the functionality Lucene uses
Targets .NET Standard 2.0.
Use OpenNLP.NET from GitHub
  • It is an IKVM port, which currently only supports .NET Framework 3.5
This is the option we currently use. Someone created a strong-named package named OpenNLP.NET.Signed. It would be preferable to get the original package owner to strong-name, but I suppose that would mean incrementing to at least version 1.9.1.1, or upgrading to a newer version of OpenNLP.

There are some other options, but the above list seem to be the most "official" ones. However, there are currently no options for .NET Core/.NET 5+ support of OpenNLP with the same API as OpenNLP 1.9.1.

IKVM

Unfortunately, while IKVM has been a reasonable go-to way to quickly support Java-based apps in the past, it has been abandoned by its main contributor in 2017 and has no .NET Core/NET Standard support.

There is an effort to get it working on .NET Core named ikvm-revived (to which I have contributed) but it seems to have been stalled for about a year and, as of the date of this writing, there isn't even a pre-release on NuGet. There is some debate whether they should support .NET Framework, but if they didn't we would still be able to target the current OpenNLP.NET version on .NET Framework.

See NuGet Repository?

Alternatives to IKVM

There was an announcement on the Microsoft Blog about .NET 5 supporting interoperability with Java, but it isn't clear what they meant by that.

https://devblogs.microsoft.com/dotnet/announcing-net-5-0-preview-1/#comment-4932

In fact, others are mentioning in the comments they cannot use NLP on .NET Core and are hoping to resolve that in .NET 5.

I have searched, but cannot find any examples anywhere of how .NET 5 supports Java interop, but if it does that would probably be a better path forward than IKVM for NLP support. However, it sounds as if this feature was punted from the official .NET 5 release.

Current Support for NLP in Lucene.NET

Since we are depending on the IKVM-based OpenNLP.NET project, our current support is limited to .NET Framework 4.5.1+.

We do have some minor issues (namely lack of InternalsVisibleTo support) due to the fact that the library is not strong-named, but these are internal. Time will tell if lack of strong-naming is going to be an issue for end users, but ideally to get strong naming we should contribute to OpenNLP.NET rather than using the strong-named clone named OpenNLP.NET.Signed.

Supporting NLP in .NET Core/.NET 5+

Most options for supporting NLP on .NET Core would require some work to put into play, and it isn't clear how much work is involved to analyze this at a high level. It also isn't clear how big the demand for this functionality will be.

While we could make an effort to change dependencies, it would be sensible to create a new assembly named after the new dependency (in the src/dotnet folder) so it is clear what it depends on and leave the existing Lucene.Net.Analysis.OpenNLP project as-is.

Another option is just to wait to see whether ikvm-revieved releases a .NET Core targeted package on NuGet and then support it when they finally do.

NOTE: If we bring back support for native .NET Collation in Lucene.Net.Analysis.Common, it is possible that its SortKeys would not be portable between .NET Framework and .NET Core/.NET 5+ (see Caveats and Comparisons). If we don't have .NET Core/.NET 5 support for Lucene.Net.OpenNLP, that collator option could cause some issues if indexing can only be done on .NET Framework, but searching is done on .NET Core or .NET 5. However, we have a collator in Lucene.Net.ICU that is stable across .NET target frameworks that could be used instead in that scenario.

Finally, I will add that while we would prefer not to hold up the release of Lucene.NET 4.8.0 to wait on .NET Core/.NET 5+ support of NLP, if some companies are willing to provide some resources/funding because they need to have either an OpenNLP or an IKVM option on .NET Core/.NET 5+, we could probably make that happen.

NightOwl888 commented 3 years ago

I have opened an issue to see whether we can get OpenNLP.NET to strong-name their assembly.

paulirwin commented 10 months ago

Unfortunately, while IKVM has been a reasonable go-to way to quickly support Java-based apps in the past, it has been abandoned by its main contributor in 2017 and has no .NET Core/NET Standard support.

FYI for everyone that ikvm-revived is now IKVM proper, and has .NET 6+ support as of v8.7. https://github.com/ikvmnet/ikvm/releases/tag/8.7.0