SPARKNLP-962: UAEEmbeddings

DevinTDHa commented 4 months ago

Description

This PR adds an Annotator for UAE embeddings. For this, new pooling operations for word embeddings have been added.

Namely poooling by

Using a token at a specific index (such as the [CLS] token, or the last token)
Max pooling across the sequence dimension
[CLS] + Mean of the embeddings

These can be set with setPoolingStrategy for the annotator.

Additionally, it fixes a bug with serializing onnx models that do not have a .onnx_data file (b73dc0b1ecdb49af9f2fa6e47b0af23d47442a53). @prabod I think you worked on this part, could you review if the fix looks good? I provided a description in the commit message. Thanks!

How Has This Been Tested?

New tests and old tests are passing.

Screenshots (if appropriate):

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] Code improvements with no or little impact
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.
[x] My change requires a change to the documentation.
[x] I have updated the documentation accordingly.
[x] I have read the CONTRIBUTING page.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed.

maziyarpanahi commented 4 months ago

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: https://github.com/JohnSnowLabs/spark-nlp/issues/14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

DevinTDHa commented 4 months ago

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: https://github.com/JohnSnowLabs/spark-nlp/issues/14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

Hi @maziyarpanahi,

Yes, the fix should prevent the error in the notebook as well.

JohnSnowLabs / spark-nlp