release/533-release-candidate

https://github.com/JohnSnowLabs/spark-nlp/pull/14196
example notebook for DocumentCharacterTextSplitter
example notebook for DeBertaForZeroShotClassification
example notebooks for BGEEmbeddings and MPNetEmbeddings
example notebook for MPNetForQuestionAnswering
example notebook + path for MPNetForSequenceClassification
Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb
Add files via upload
Delete examples/python/annotation/text/english/language-translation/Multilingual_Translation_with_M2M100.ipynb
fixing colab link for M2M100 notebook
https://github.com/JohnSnowLabs/spark-nlp/pull/14199

Sentence embeddings using Universal AnglE Embedding (UAE). UAE is a novel angle-optimized text embedding model, designed to improve semantic textual similarity tasks, which are crucial for Large Language Model (LLM) applications. By introducing angle optimization in a complex space, AnglE effectively mitigates saturation of the cosine similarity function.

Additionally, it fixes a bug with serializing onnx models that do not have a .onnx_data file (https://github.com/JohnSnowLabs/spark-nlp/commit/b73dc0b1ecdb49af9f2fa6e47b0af23d47442a53). @prabod I think you worked on this part, could you review if the fix looks good? I provided a description in the commit message. Thanks!

https://github.com/JohnSnowLabs/spark-nlp/pull/14224 1 - gets3Object that includes getLastModified() (just contains a summary, do not download the whole metadata.json file.) 2- check the condition (cache contains up-to-date metadata) 3- If the cache contains up-to-date metadata, get it; Otherwise, download it, set it to the cache, and return it.
https://github.com/JohnSnowLabs/spark-nlp/pull/14225 This PR introduces critical enhancements and optimizations to the processing of the CoNLL-U format, which is instrumental in the training of Dependency Parsers. The key improvements include:

Enhanced Multiword Token Handling: This update ensures proper processing of lines identified by id columns as multiword tokens (e.g., 2-3 no ). This adjustment guarantees that multiword tokens are accurately recognized and managed throughout the parsing process.

Improved Handling of Missing uPos Values: Before this change, lines with unavailable uPos values could disrupt the parsing flow. With the current enhancements, the system gracefully handles such scenarios, ensuring uninterrupted parsing operations even in the absence of uPos values.

Beyond these functional enhancements, this PR undertakes a comprehensive refactoring of the underlying codebase. The refactoring efforts focus on enhancing code readability, cleanliness, and maintainability. These improvements pave the way for easier future modifications and debugging, aligning with best practices in software development.

JohnSnowLabs / spark-nlp

release/533-release-candidate #14227