JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.77k stars 705 forks source link

SPARKNLP-962: UAEEmbeddings #14199

Closed DevinTDHa closed 3 months ago

DevinTDHa commented 4 months ago

Description

This PR adds an Annotator for UAE embeddings. For this, new pooling operations for word embeddings have been added.

Namely poooling by

  1. Using a token at a specific index (such as the [CLS] token, or the last token)
  2. Max pooling across the sequence dimension
  3. [CLS] + Mean of the embeddings

These can be set with setPoolingStrategy for the annotator.

Additionally, it fixes a bug with serializing onnx models that do not have a .onnx_data file (b73dc0b1ecdb49af9f2fa6e47b0af23d47442a53). @prabod I think you worked on this part, could you review if the fix looks good? I provided a description in the commit message. Thanks!

How Has This Been Tested?

New tests and old tests are passing.

Screenshots (if appropriate):

Types of changes

Checklist:

maziyarpanahi commented 4 months ago

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: https://github.com/JohnSnowLabs/spark-nlp/issues/14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

DevinTDHa commented 4 months ago

Hi @DevinTDHa

Regarding the fix in onnx serialization, is it related to this issue: https://github.com/JohnSnowLabs/spark-nlp/issues/14194 (https://colab.research.google.com/drive/119u6hXoT1PRB9F38InuEV-bm4g1uu9UH?usp=sharing)

Hi @maziyarpanahi,

Yes, the fix should prevent the error in the notebook as well.