JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
854 stars 130 forks source link

2000%+ Speedup on small data, 63 new models for 100+ Languages with 6 new supported Transformer classes including BERT, XLM-RoBERTa, alBERT, Longformer, XLnet based models, 48 NER profiling helathcare pipelines and much more in John Snow Labs NLU 3.3.0 #83

Closed C-K-Loan closed 2 years ago

C-K-Loan commented 2 years ago

We are incredibly excited to announce NLU 3.3.0 has been released! It comes with a up to 2000%+ speedup on small datasets, 6 new Types of Deep Learning transformer models, including RoBertaForTokenClassification,XlmRoBertaForTokenClassification,AlbertForTokenClassification,LongformerForTokenClassification,XlnetForTokenClassification,XlmRoBertaSentenceEmbeddings. In total there are 63 NLP Models 6 New Languages Supported which are Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda with their corresponding ISO codes ig, lg, lou, pcm, wo,rw with New SOTA XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages and 2 new Multilingual Embeddings with 100+ supported languages via XLM-Roberta are available.

On the healthcare NLP side we are glad to announce 18 new NLP for Healthcare models including NER Profiling pretrained pipelines to run 48 different Clinical NER and 21 Different Biobert Models At Once Over the Input Text New BERT-Based Deidentification NER Model, Sentence Entity Resolver Models For German Language New Spell Checker Model For Drugs , 3 New Sentence Entity Resolver Models (3-char ICD10CM, RxNorm_NDC, HCPCS) 5 New Clinical NER Models (Trained By BertForTokenClassification Approach) ,Radiology NER Model Trained On cheXpert Datasetand New UMLS Sentence Entity Resolver Models

Additionally 2 new tutorials are avaiable, NLU & Streamlit Crashcourse and NLU for Healthcare Crashcourse of every of the 50 + healthcare Domains and 200+ healthcare models

New Features and Improvements

2000%+ Speedup prediction for small datasets

NLU pipelines now predict up to 2000% faster by optimizing integration with Spark NLP's light pipelines. NLU will configure usage of this automatically, but it can be turned off as well via multithread=False
NLU 3.3.0 Benchmark

50x faster saving of NLU Pipelines

Up to 50x faster saving Spark NLP/ NLU models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!

New Annotator Classes Integrated

The following new transformer classes are available with various pretrained weights in 1 line of code :

New Transformer Models

The following models are available from the amazing Spark NLP 3.3.0 and 3.3.1 releases which includes NLP models for Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian, Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda

Language NLU Reference Spark NLP Reference Task
ig ig.embed.xlm_roberta xlm_roberta_base_finetuned_igbo Embeddings
ig ig.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_igbo Embeddings
lg lg.embed.xlm_roberta xlm_roberta_base_finetuned_luganda Embeddings
lg lg.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_luganda Embeddings
wo wo.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_wolof Embeddings
wo wo.embed.xlm_roberta xlm_roberta_base_finetuned_wolof Embeddings
rw rw.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_kinyarwanda Embeddings
rw rw.embed.xlm_roberta xlm_roberta_base_finetuned_kinyarwanda Embeddings
sw sw.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_swahili Embeddings
sw sw.embed.xlm_roberta xlm_roberta_base_finetuned_swahili Embeddings
ha ha.embed.xlm_roberta xlm_roberta_base_finetuned_hausa Embeddings
ha ha.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_hausa Embeddings
am am.embed.xlm_roberta xlm_roberta_base_finetuned_amharic Embeddings
am am.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_amharic Embeddings
yo yo.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_yoruba Embeddings
yo yo.embed.xlm_roberta xlm_roberta_base_finetuned_yoruba Embeddings
fa fa.classify.token_roberta_token_classifier_zwnj_base_ner roberta_token_classifier_zwnj_base_ner Named Entity Recognition
yi detect_sentence sentence_detector_dl Sentence Detection
uk detect_sentence sentence_detector_dl Sentence Detection
te detect_sentence sentence_detector_dl Sentence Detection
ta detect_sentence sentence_detector_dl Sentence Detection
so detect_sentence sentence_detector_dl Sentence Detection
sd detect_sentence sentence_detector_dl Sentence Detection
ru detect_sentence sentence_detector_dl Sentence Detection
pa detect_sentence sentence_detector_dl Sentence Detection
ne detect_sentence sentence_detector_dl Sentence Detection
mr detect_sentence sentence_detector_dl Sentence Detection
ml detect_sentence sentence_detector_dl Sentence Detection
kn detect_sentence sentence_detector_dl Sentence Detection
id detect_sentence sentence_detector_dl Sentence Detection
gu detect_sentence sentence_detector_dl Sentence Detection
bs detect_sentence sentence_detector_dl Sentence Detection
en en.classify.token_roberta_large_token_classifier_conll03 roberta_large_token_classifier_conll03 Named Entity Recognition
en en.classify.token_roberta_base_token_classifier_ontonotes roberta_base_token_classifier_ontonotes Named Entity Recognition
en en.classify.token_roberta_base_token_classifier_conll03 roberta_base_token_classifier_conll03 Named Entity Recognition
en en.classify.token_distilroberta_base_token_classifier_ontonotes distilroberta_base_token_classifier_ontonotes Named Entity Recognition
en en.classify.token_albert_large_token_classifier_conll03 albert_large_token_classifier_conll03 Named Entity Recognition
en en.classify.token_albert_base_token_classifier_conll03 albert_base_token_classifier_conll03 Named Entity Recognition
en en.classify.token_xlnet_base_token_classifier_conll03 xlnet_base_token_classifier_conll03 Named Entity Recognition
en en.classify.token_roberta.large_token_classifier_ontonotes roberta_large_token_classifier_ontonotes Named Entity Recognition
en en.classify.token_albert.xlarge_token_classifier_conll03 albert_xlarge_token_classifier_conll03 Named Entity Recognition
en en.classify.token_xlnet.large_token_classifier_conll03 xlnet_large_token_classifier_conll03 Named Entity Recognition
en en.classify.token_longformer.base_token_classifier_conll03 longformer_base_token_classifier_conll03 Named Entity Recognition
xx xx.classify.token_xlm_roberta.token_classifier_ner_40_lang xlm_roberta_token_classifier_ner_40_lang Named Entity Recognition
xx xx.embed.xlm_roberta_large xlm_roberta_large Embeddings

New Healthcare models

The following models are available from the amazing Spark NLP for Healthcare releases 3.3.0, 3.2.3, 3.3.1, which includes 48 Multi-NER tuning pipelines, BERT-based DEidentification, German NER resolvers, Spell Checkers for Drugs, 5 ner NER models trained via BErtForTokenClassification, NER models for Radiology CID10CM, RxNORM NDC and HCPCSS models and UMLS sentence resolver models

Language NLU Reference Spark NLP Reference Task
de de.resolve.snomed sbertresolve_snomed Entity Resolution
de de.resolve.icd10gm sbertresolve_icd10gm Entity Resolution
en en.med_ner.profiling_clinical ner_profiling_clinical Pipeline Healthcare
en en.med_ner.profiling_biobert ner_profiling_biobert Pipeline Healthcare
en en.med_ner.chexpert ner_chexpert Named Entity Recognition
en en.classify.token_bert.ner_bacteria bert_token_classifier_ner_bacteria Named Entity Recognition
en en.classify.token_bert.ner_anatomy bert_token_classifier_ner_anatomy Named Entity Recognition
en en.classify.token_bert.ner_drugs bert_token_classifier_ner_drugs Named Entity Recognition
en en.classify.token_bert.ner_jsl_slim bert_token_classifier_ner_jsl_slim Named Entity Recognition
en en.classify.token_bert.ner_ade bert_token_classifier_ner_ade Named Entity Recognition
en en.resolve.rxnorm_ndc sbiobertresolve_rxnorm_ndc Entity Resolution
en en.resolve.icd10cm_generalised sbiobertresolve_icd10cm_generalised Entity Resolution
en en.resolve.hcpcs sbiobertresolve_hcpcs Entity Resolution
en en.spell.drug_norvig spellcheck_drug_norvig Spell Check
en en.classify.token_bert.ner_deid bert_token_classifier_ner_deid Named Entity Recognition
en en.classify.token_bert.ner_chemical bert_token_classifier_ner_chemicals Named Entity Recognition
en en.resolve.umls_disease_syndrome sbiobertresolve_umls_disease_syndrome Entity Resolution
en en.resolve.umls_clinical_drugs sbiobertresolve_umls_clinical_drugs Entity Resolution

Updated Model Names

The nlu model references have been updated to better reflect their use-cases.

New Tutorial Videos

Optional get_embeddings parameter for pipelines

NLU pipelines can now be forced to not return embeddings via get_embeddings parameter.

Updated Compatibility Docs

Added documentation section regarding compatibility of NLU, Spark NLP and Spark NLP for healthcare

Bugfixes

Additional NLU ressources

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0