JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
854 stars 130 forks source link

1 line to OCR for images, PDFS and DOCX, Text Generation with GPT2 and new T5 models, Sequence Classification with XlmRoBerta, RoBerta, Xlnet, Longformer and Albert, Transformer based medical NER with MedicalBertForTokenClassifier, 80 new models, 20+ new languages including various African and Scandinavian and much more in John Snow Labs NLU 3.4.0 ! #92

Closed C-K-Loan closed 2 years ago

C-K-Loan commented 2 years ago

We are incredibly excited to announce John Snow Labs NLU 3.4.0 has been released! This release features 11 new annotator classes and 80 new models, including 3 OCR Transformers which enable you to extract text from various file types, support for GPT2 and new pretrained T5 models for Text Generation and dozens more of new transformer based models for Token and Sequence Classification. This includes 8 new Sequence classifier models which can be pretrained in Huggingface and imported into Spark NLP and NLU. Finally, the NLU tutorial page of the 140+ notebooks has been updated

New NLU OCR Features

3 new OCR based spells are supported, which enable extracting text from files of type JPEG, PNG, BMP, WBMP, GIF, JPG, TIFF, DOCX, PDF in just 1 line of code. You need a Spark OCR license for using these, which is available for free here and refer to the new OCR tutorial notebook
Open In Colab

New NLU Healthcare Features

The healthcare side features a new MedicalBertForTokenClassifier annotator which is a Bert based model for token classification problems like Named Entity Recognition,
Parts of Speech and much more. Overall there are 28 new models which include German De-Identification models, English NER models for extracting Drug Development Trials,
Clinical Abbreviations and Acronyms, NER models for chemical compounds/drugs and genes/proteins, updated MedicalBertForTokenClassifier NER models for the medical domains Adverse drug Events,
Anatomy, Chemicals, Genes,Proteins, Cellular/Molecular Biology, Drugs, Bacteria, De-Identification and general Medical and Clinical Named Entities.
For Entity Relation Extraction between entity pairs new models for interaction between Drugs and Proteins.
For Entity Resolution new models for resolving Clinical Abbreviations and Acronyms to their full length names and also a model for resolving Drug Substance Entities to the categories
Clinical Drug, Pharmacologic Substance, Antibiotic, Hazardous or Poisonous Substance and new resolvers for LOINC and SNOMED terminologies.

New NLU Open source Features

On the open source side we have new support for Open Ai's GPT2 for various text sequence to sequence problems and additionally the following new Transformer models are supported : RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, LongformerForSequenceClassification, AlbertForSequenceClassification, XlnetForSequenceClassification, Word2Vec with various pre-trained weights for various problems!

New GPT2 models for generating text conditioned on some input,
New T5 style transfer models for active to passive, formal to informal, informal to formal, passive to active sequence to sequence generation.
Additionally, a new T5 model for generating SQL code from natural language input is provided.

On top of this dozens new Transformer based Sequence Classifiers and Token Classifiers have been released, this is includes for Token Classifier the following models :
Multi-Lingual general NER models for 10 African Languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá),
10 high resourced languages (10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese),
6 Scandinavian languages (Danish, Norwegian-Bokmål, Norwegian-Nynorsk, Swedish, Icelandic, Faroese) ,
Uni-Lingual NER models for general entites in the language Chinese, Hindi, Islandic, Indonesian
and finally English NER models for extracting entities related to Stocks Ticker Symbols, Restaurants, Time.

For Sequence Classification new models for classifying Toxicity in Russian text and English models for Movie Reviews, News Categorization, Sentimental Tone and General Sentiment

New NLU OCR Models

The following Transformers have been integrated from Spark OCR

NLU Spell Transformer Class
nlu.load(img2text) ImageToText
nlu.load(pdf2text) PdfToText
nlu.load(doc2text) DocToText

New Open Source Models

Integration for the 49 new models from the colossal Spark NLP 3.4.0 release

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.gpt2.distilled gpt2_distilled Text Generation GPT2Transformer
en en.gpt2 gpt2 Text Generation GPT2Transformer
en en.gpt2.medium gpt2_medium Text Generation GPT2Transformer
en en.gpt2.large gpt_large Text Generation GPT2Transformer
en en.t5.active_to_passive_styletransfer t5_active_to_passive_styletransfer Text Generation T5Transformer
en en.t5.formal_to_informal_styletransfer t5_formal_to_informal_styletransfer Text Generation T5Transformer
en en.t5.grammar_error_corrector t5_grammar_error_corrector Text Generation T5Transformer
en en.t5.informal_to_formal_styletransfer t5_informal_to_formal_styletransfer Text Generation T5Transformer
en en.t5.passive_to_active_styletransfer t5_passive_to_active_styletransfer Text Generation T5Transformer
en en.t5.wikiSQL t5_small_wikiSQL Text Generation T5Transformer
xx xx.ner.masakhaner xlm_roberta_large_token_classifier_masakhaner Named Entity Recognition XlmRoBertaForTokenClassification
xx xx.ner.high_resourced_lang xlm_roberta_large_token_classifier_hrl Named Entity Recognition XlmRoBertaForTokenClassification
xx xx.ner.scandinavian bert_token_classifier_scandi_ner Named Entity Recognition BertForTokenClassification
en en.embed.electra.medical electra_medal_acronym Embeddings BertEmbeddings
en en.ner.restaurant nerdl_restaurant_100d Named Entity Recognition NerDLModel
en en.embed.word2vec.gigaword_wiki word2vec_gigaword_wiki_300 Embeddings Word2VecModel
en en.embed.word2vec.gigaword word2vec_gigaword_300 Embeddings Word2VecModel
en en.classify.xlm_roberta.imdb xlm_roberta_base_sequence_classifier_imdb Text Classification XlmRoBertaForSequenceClassification
en en.classify.xlm_roberta.ag_news xlm_roberta_base_sequence_classifier_ag_news Text Classification XlmRoBertaForSequenceClassification
en en.classify.roberta.imdb roberta_base_sequence_classifier_imdb Text Classification RoBertaForSequenceClassification
en en.classify.roberta.ag_news roberta_base_sequence_classifier_ag_news Text Classification RoBertaForSequenceClassification
en en.classify.albert.ag_news albert_base_sequence_classifier_ag_news Text Classification AlbertForSequenceClassification
en en.classify.albert.imdb albert_base_sequence_classifier_imdb Text Classification AlbertForSequenceClassification
en en.classify.ag_news.longformer longformer_base_sequence_classifier_ag_news Text Classification LongformerForSequenceClassification
en en.classify.imdb.xlnet xlnet_base_sequence_classifier_imdb Text Classification XlnetForSequenceClassification
en en.classify.finance_sentiment bert_sequence_classifier_finbert_tone Sentiment Analysis BertForSequenceClassification
en en.classify.imdb.longformer longformer_base_sequence_classifier_imdb Text Classification LongformerForSequenceClassification
en en.classify.ag_news.longformer longformer_base_sequence_classifier_ag_news Text Classification LongformerForSequenceClassification
en en.ner.time roberta_token_classifier_timex_semeval Named Entity Recognition RoBertaForTokenClassification
en en.ner.stocks_ticker roberta_token_classifier_ticker Named Entity Recognition RoBertaForTokenClassification
ru ru.classify.toxic bert_sequence_classifier_toxicity Text Classification BertForSequenceClassification
it it.classify.sentiment bert_sequence_classifier_sentiment Sentiment Analysis BertForSequenceClassification
es es.ner wikiner_6B_100 Named Entity Recognition NerDLModel
is is.ner roberta_token_classifier_icelandic_ner Named Entity Recognition RoBertaForTokenClassification
id id.pos roberta_token_classifier_pos_tagger Part of Speech Tagging RoBertaForTokenClassification
tr tr.ner turkish_ner_840B_300 Named Entity Recognition NerDLModel
id id.ner xlm_roberta_large_token_classification_ner Named Entity Recognition XlmRoBertaForTokenClassification
de de.ner xlm_roberta_large_token_classifier_conll03 Named Entity Recognition XlmRoBertaForTokenClassification
hi hi.ner bert_token_classifier_hi_en_ner Named Entity Recognition BertForTokenClassification
nl nl.ner wikiner_6B_100 Named Entity Recognition NerDLModel
zh zh.ner bert_token_classifier_chinese_ner Named Entity Recognition BertForTokenClassification
fr fr.classify.xlm_roberta.allocine xlm_roberta_base_sequence_classifier_allocine Text Classification XlmRoBertaForSequenceClassification
ur ur.classify.fakenews classifierdl_urduvec_fakenews Text Classification ClassifierDLModel
ur ur.classify.news classifierdl_bert_news Text Classification ClassifierDLModel
fi fi.embed_sentence.bert.uncased bert_base_finnish_uncased Embeddings BertSentenceEmbeddings
fi fi.embed_sentence.bert bert_base_finnish_uncased Embeddings BertSentenceEmbeddings
fi fi.embed_sentence.bert.cased bert_base_finnish_cased Embeddings BertSentenceEmbeddings
te te.embed.distilbert distilbert_uncased Embeddings DistilBertEmbeddings
sw sw.embed.xlm_roberta xlm_roberta_base_finetuned_swahili Embeddings XlmRoBertaEmbeddings

New Healthcare Models

Integration for the 28 new models from the amazing Spark NLP for healthcare 3.4.0 release

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.med_ner.chemprot.bert bert_token_classifier_ner_chemprot Named Entity Recognition MedicalBertForTokenClassifier
en en.med_ner.chemprot.bert bert_token_classifier_ner_chemprot Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_bacteria bert_token_classifier_ner_bacteria Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_bacteria bert_token_classifier_ner_bacteria Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_anatomy bert_token_classifier_ner_anatomy Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_anatomy bert_token_classifier_ner_anatomy Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_drugs bert_token_classifier_ner_drugs Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_drugs bert_token_classifier_ner_drugs Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_jsl_slim bert_token_classifier_ner_jsl_slim Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_jsl_slim bert_token_classifier_ner_jsl_slim Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_ade bert_token_classifier_ner_ade Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_ade bert_token_classifier_ner_ade Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_deid bert_token_classifier_ner_deid Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_deid bert_token_classifier_ner_deid Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_clinical bert_token_classifier_ner_clinical Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_clinical bert_token_classifier_ner_clinical Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_jsl bert_token_classifier_ner_jsl Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_jsl bert_token_classifier_ner_jsl Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_jsl bert_token_classifier_ner_jsl Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_chemical bert_token_classifier_ner_chemicals Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.ner_chemical bert_token_classifier_ner_chemicals Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.bionlp bert_token_classifier_ner_bionlp Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.bionlp bert_token_classifier_ner_bionlp Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.cellular bert_token_classifier_ner_cellular Named Entity Recognition MedicalBertForTokenClassifier
en en.classify.token_bert.cellular bert_token_classifier_ner_cellular Named Entity Recognition MedicalBertForTokenClassifier
en en.med_ner.abbreviation_clinical ner_abbreviation_clinical Named Entity Recognition MedicalNerModel
en en.med_ner.drugprot_clinical ner_drugprot_clinical Named Entity Recognition MedicalNerModel
en en.ner.drug_development_trials bert_token_classifier_drug_development_trials Named Entity Recognition BertForTokenClassification
en en.med_ner.chemprot ner_chemprot_biobert Named Entity Recognition MedicalNerModel
en en.relation.drugprot redl_drugprot_biobert Relation Extraction RelationExtractionDLModel
en en.relation.drugprot.clinical re_drugprot_clinical Relation Extraction RelationExtractionModel
en en.resolve.clinical_abbreviation_acronym sbiobertresolve_clinical_abbreviation_acronym Entity Resolution SentenceEntityResolverModel
en en.resolve.clinical_abbreviation_acronym sbiobertresolve_clinical_abbreviation_acronym Entity Resolution SentenceEntityResolverModel
en en.resolve.umls_drug_substance sbiobertresolve_umls_drug_substance Entity Resolution SentenceEntityResolverModel
en en.resolve.loinc_cased sbiobertresolve_loinc_cased Entity Resolution SentenceEntityResolverModel
en en.resolve.loinc_uncased sbluebertresolve_loinc_uncased Entity Resolution SentenceEntityResolverModel
en en.embed_sentence.biobert.rxnorm sbiobert_jsl_rxnorm_cased Entity Resolution BertSentenceEmbeddings
en en.embed_sentence.bert_uncased.rxnorm sbert_jsl_medium_rxnorm_uncased Embeddings BertSentenceEmbeddings
en en.embed_sentence.bert_uncased.rxnorm sbert_jsl_medium_rxnorm_uncased Embeddings BertSentenceEmbeddings
en en.resolve.snomed_drug sbiobertresolve_snomed_drug Entity Resolution SentenceEntityResolverModel
de de.med_ner.deid_subentity ner_deid_subentity Named Entity Recognition MedicalNerModel
de de.med_ner.deid_generic ner_deid_generic Named Entity Recognition MedicalNerModel
de de.embed.w2v w2v_cc_300d Embeddings WordEmbeddingsModel

Additional NLU resources

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0