apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Contributing a deep-learning, BERT-based analyzer #13065

Open lmessinger opened 9 months ago

lmessinger commented 9 months ago

Description

Hi,

We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How can we do that and be accepted? Can we compile it to native code and use JNI or Panama ? If not, what is the best approacch?

https://github.com/apache/lucene/issues/12502#issuecomment-1675084211

@uschindler would be very happy to hear what you think

benwtrent commented 9 months ago

For the analyzer, are you meaning something that tokenizes into an embedding?

Or just creates the tokens (wordpiece + dictionary)?

lmessinger commented 9 months ago

I mean, create just the tokens - the lemmas / wordpieces

benwtrent commented 9 months ago

@lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up.

Do y'all not have a Java one?

Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ?

lmessinger commented 9 months ago

hi,

in Hebrew and other Semitic languages, lemmas are context-dependent. eg שמן could be interpreted as fat, oil, their name, from all dependent on the context so yes, we do need inference. to do inference, python is the language. either we compile the python into native code (not so easy but possible) or use it in a container, as a web server

On Tue, Feb 6, 2024 at 4:48 PM Benjamin Trent @.***> wrote:

@lmessinger https://github.com/lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up.

Do y'all not have a Java one?

Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ?

— Reply to this email directly, view it on GitHub https://github.com/apache/lucene/issues/13065#issuecomment-1929933564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM5MHITFNCDP5H6FMW6PVTYSI7FNAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZHEZTGNJWGQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Lior Messinger +1-646-3730044 +972-546-888401

dweiss commented 9 months ago

It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example.

lmessinger commented 9 months ago

Hi,

Got it. Pointing to the project from the documentation would actually be very valuable to the Hebrew community. How can that be done? is the documentation also on github, so we can add it there as PR for approval?

thanks! Lior

On Tue, Feb 6, 2024 at 10:06 PM Dawid Weiss @.***> wrote:

It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example.

— Reply to this email directly, view it on GitHub https://github.com/apache/lucene/issues/13065#issuecomment-1930667697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM5MHJY3T5ODVXTMPJVEMDYSKEKZAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGY3DONRZG4 . You are receiving this because you were mentioned.Message ID: @.***>

-- Lior Messinger +1-646-3730044 +972-546-888401

dweiss commented 9 months ago

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

chatman commented 9 months ago

How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact?

On Fri, 9 Feb, 2024, 1:20 am Dawid Weiss, @.***> wrote:

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

— Reply to this email directly, view it on GitHub https://github.com/apache/lucene/issues/13065#issuecomment-1934831122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCR5FBPK7ZCAOW5U2YYOTYSUT65AVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUHAZTCMJSGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

uschindler commented 9 months ago

How can that be done?

This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs:

https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html

An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas.

It would be better located in the module "analysis" (which is just the parent of all analyzers). Unfortunately this module does not create javadocs, so analysis-common is the only location.

I think it would be a good idea to add there a list to external resources of analysis components. Lucene is a flexible library with extension points through SPI, so we can list all external contributions there.

This page is also missing an overview on the analysis submodules.

An alternative (an in my opinion better) idea is to put a list as Markdown file into the documentation package: https://github.com/apache/lucene/tree/main/lucene/documentation/src/markdown

All md files there are compiled to HTML and can be linked in the template file for index.html, too.

How about something with the source maintained in the sandbox dir (along with instructions to build), but no corresponding official release artifact?

I don't think this is a good idea. It won't be tested (as we can't run the build) and also it is inconsequent.

We had that in the past for the DirectIODirectory and WindowsDirectory. All those were not maintained -- and did not build anymore, although there were build scripts. The Java parts were building, the JNI parts were not longer matching the Java implementations. I may be wrong, but when we looked into this, it was almost impossible to make it work again.

Luckily they were rewritten using Java 11+ APIs and are now part of official distribution.