2000%+ Speedup on small data, 63 new models for 100+ Languages with 6 new supported Transformer classes including BERT, XLM-RoBERTa, alBERT, Longformer, XLnet based models, 48 NER profiling helathcare pipelines and much more in John Snow Labs NLU 3.3.0

We are incredibly excited to announce NLU 3.3.0 has been released! It comes with a up to 2000%+ speedup on small datasets, 6 new Types of Deep Learning transformer models, including RoBertaForTokenClassification,XlmRoBertaForTokenClassification,AlbertForTokenClassification,LongformerForTokenClassification,XlnetForTokenClassification,XlmRoBertaSentenceEmbeddings. In total there are 63 NLP Models 6 New Languages Supported which are Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda with their corresponding ISO codes ig, lg, lou, pcm, wo,rw with New SOTA XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages and 2 new Multilingual Embeddings with 100+ supported languages via XLM-Roberta are available.

On the healthcare NLP side we are glad to announce 18 new NLP for Healthcare models including NER Profiling pretrained pipelines to run 48 different Clinical NER and 21 Different Biobert Models At Once Over the Input Text New BERT-Based Deidentification NER Model, Sentence Entity Resolver Models For German Language New Spell Checker Model For Drugs , 3 New Sentence Entity Resolver Models (3-char ICD10CM, RxNorm_NDC, HCPCS) 5 New Clinical NER Models (Trained By BertForTokenClassification Approach) ,Radiology NER Model Trained On cheXpert Datasetand New UMLS Sentence Entity Resolver Models

Additionally 2 new tutorials are avaiable, NLU & Streamlit Crashcourse and NLU for Healthcare Crashcourse of every of the 50 + healthcare Domains and 200+ healthcare models

New Features and Improvements

2000%+ Speedup prediction for small datasets

NLU pipelines now predict up to 2000% faster by optimizing integration with Spark NLP's light pipelines. NLU will configure usage of this automatically, but it can be turned off as well via multithread=False
NLU 3.3.0 Benchmark

50x faster saving of NLU Pipelines

Up to 50x faster saving Spark NLP/ NLU models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!

New Annotator Classes Integrated

The following new transformer classes are available with various pretrained weights in 1 line of code :

New Transformer Models

The following models are available from the amazing Spark NLP 3.3.0 and 3.3.1 releases which includes NLP models for Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian, Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda

Language	NLU Reference	Spark NLP Reference	Task
ig	ig.embed.xlm_roberta	xlm_roberta_base_finetuned_igbo	Embeddings
ig	ig.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_igbo	Embeddings
lg	lg.embed.xlm_roberta	xlm_roberta_base_finetuned_luganda	Embeddings
lg	lg.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_luganda	Embeddings
wo	wo.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_wolof	Embeddings
wo	wo.embed.xlm_roberta	xlm_roberta_base_finetuned_wolof	Embeddings
rw	rw.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_kinyarwanda	Embeddings
rw	rw.embed.xlm_roberta	xlm_roberta_base_finetuned_kinyarwanda	Embeddings
sw	sw.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_swahili	Embeddings
sw	sw.embed.xlm_roberta	xlm_roberta_base_finetuned_swahili	Embeddings
ha	ha.embed.xlm_roberta	xlm_roberta_base_finetuned_hausa	Embeddings
ha	ha.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_hausa	Embeddings
am	am.embed.xlm_roberta	xlm_roberta_base_finetuned_amharic	Embeddings
am	am.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_amharic	Embeddings
yo	yo.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_yoruba	Embeddings
yo	yo.embed.xlm_roberta	xlm_roberta_base_finetuned_yoruba	Embeddings
fa	fa.classify.token_roberta_token_classifier_zwnj_base_ner	roberta_token_classifier_zwnj_base_ner	Named Entity Recognition
yi	detect_sentence	sentence_detector_dl	Sentence Detection
uk	detect_sentence	sentence_detector_dl	Sentence Detection
te	detect_sentence	sentence_detector_dl	Sentence Detection
ta	detect_sentence	sentence_detector_dl	Sentence Detection
so	detect_sentence	sentence_detector_dl	Sentence Detection
sd	detect_sentence	sentence_detector_dl	Sentence Detection
ru	detect_sentence	sentence_detector_dl	Sentence Detection
pa	detect_sentence	sentence_detector_dl	Sentence Detection
ne	detect_sentence	sentence_detector_dl	Sentence Detection
mr	detect_sentence	sentence_detector_dl	Sentence Detection
ml	detect_sentence	sentence_detector_dl	Sentence Detection
kn	detect_sentence	sentence_detector_dl	Sentence Detection
id	detect_sentence	sentence_detector_dl	Sentence Detection
gu	detect_sentence	sentence_detector_dl	Sentence Detection
bs	detect_sentence	sentence_detector_dl	Sentence Detection
en	en.classify.token_roberta_large_token_classifier_conll03	roberta_large_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_roberta_base_token_classifier_ontonotes	roberta_base_token_classifier_ontonotes	Named Entity Recognition
en	en.classify.token_roberta_base_token_classifier_conll03	roberta_base_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_distilroberta_base_token_classifier_ontonotes	distilroberta_base_token_classifier_ontonotes	Named Entity Recognition
en	en.classify.token_albert_large_token_classifier_conll03	albert_large_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_albert_base_token_classifier_conll03	albert_base_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_xlnet_base_token_classifier_conll03	xlnet_base_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_roberta.large_token_classifier_ontonotes	roberta_large_token_classifier_ontonotes	Named Entity Recognition
en	en.classify.token_albert.xlarge_token_classifier_conll03	albert_xlarge_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_xlnet.large_token_classifier_conll03	xlnet_large_token_classifier_conll03	Named Entity Recognition
en	en.classify.token_longformer.base_token_classifier_conll03	longformer_base_token_classifier_conll03	Named Entity Recognition
xx	xx.classify.token_xlm_roberta.token_classifier_ner_40_lang	xlm_roberta_token_classifier_ner_40_lang	Named Entity Recognition
xx	xx.embed.xlm_roberta_large	xlm_roberta_large	Embeddings

New Healthcare models

The following models are available from the amazing Spark NLP for Healthcare releases 3.3.0, 3.2.3, 3.3.1, which includes 48 Multi-NER tuning pipelines, BERT-based DEidentification, German NER resolvers, Spell Checkers for Drugs, 5 ner NER models trained via BErtForTokenClassification, NER models for Radiology CID10CM, RxNORM NDC and HCPCSS models and UMLS sentence resolver models

Language	NLU Reference	Spark NLP Reference	Task
de	de.resolve.snomed	sbertresolve_snomed	Entity Resolution
de	de.resolve.icd10gm	sbertresolve_icd10gm	Entity Resolution
en	en.med_ner.profiling_clinical	ner_profiling_clinical	Pipeline Healthcare
en	en.med_ner.profiling_biobert	ner_profiling_biobert	Pipeline Healthcare
en	en.med_ner.chexpert	ner_chexpert	Named Entity Recognition
en	en.classify.token_bert.ner_bacteria	bert_token_classifier_ner_bacteria	Named Entity Recognition
en	en.classify.token_bert.ner_anatomy	bert_token_classifier_ner_anatomy	Named Entity Recognition
en	en.classify.token_bert.ner_drugs	bert_token_classifier_ner_drugs	Named Entity Recognition
en	en.classify.token_bert.ner_jsl_slim	bert_token_classifier_ner_jsl_slim	Named Entity Recognition
en	en.classify.token_bert.ner_ade	bert_token_classifier_ner_ade	Named Entity Recognition
en	en.resolve.rxnorm_ndc	sbiobertresolve_rxnorm_ndc	Entity Resolution
en	en.resolve.icd10cm_generalised	sbiobertresolve_icd10cm_generalised	Entity Resolution
en	en.resolve.hcpcs	sbiobertresolve_hcpcs	Entity Resolution
en	en.spell.drug_norvig	spellcheck_drug_norvig	Spell Check
en	en.classify.token_bert.ner_deid	bert_token_classifier_ner_deid	Named Entity Recognition
en	en.classify.token_bert.ner_chemical	bert_token_classifier_ner_chemicals	Named Entity Recognition
en	en.resolve.umls_disease_syndrome	sbiobertresolve_umls_disease_syndrome	Entity Resolution
en	en.resolve.umls_clinical_drugs	sbiobertresolve_umls_clinical_drugs	Entity Resolution

Updated Model Names

The nlu model references have been updated to better reflect their use-cases.

en.classify.token_bert.conll03
en.classify.token_bert.large_conll03
en.classify.token_bert.ontonote
en.classify.token_bert.large_ontonote
en.classify.token_bert.few_nerd
en.classify.token_bert.classifier_ner_btc
es.classify.token_bert.spanish_ner
ja.classify.token_bert.classifier_ner_ud_gsd
fa.classify.token_bert.parsbert_armanner
fa.classify.token_bert.parsbert_ner
fa.classify.token_bert.parsbert_peymaner
sv.classify.token_bert.swedish_ner
tr.classify.token_bert.turkish_ner
en.classify.token_bert.ner_clinical
en.classify.token_bert.ner_jsl

New Tutorial Videos

Optional get_embeddings parameter for pipelines

NLU pipelines can now be forced to not return embeddings via get_embeddings parameter.

Updated Compatibility Docs

Added documentation section regarding compatibility of NLU, Spark NLP and Spark NLP for healthcare

Bugfixes

Fixed a bug with Pyspark versions 3.0 and below that caused failure of predicting with pipeline
Fixed a bug that caused the results of TokenClassifier Models to not be properly extracted

Additional NLU ressources

140+ NLU Tutorials
Streamlit visualizations docs
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
Spark NLP publications
NLU in Action
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0

JohnSnowLabs / nlu