JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
854 stars 130 forks source link

Zero-Shot-Relation-Extraction, DeBERTa for Sequence Classification, 150+ new models, 60+ Languages in John Snow Labs NLU 3.4.3 #116

Closed C-K-Loan closed 2 years ago

C-K-Loan commented 2 years ago

We are very excited to announce NLU 3.4.3 has been released!

This release features new models for Zero-Shot-Relation-Extraction, DeBERTa for Sequence Classification, Deidentification in French and Italian and Lemmatizers, Parts of Speech Taggers and Word2Vec Embeddings for over 66 languages, with 20 languages being covered for the first time by NLU, including ancient and exotic languages like Ancient Greek, Old Russian, Old French and much more. Once again we would like to thank our community to make this release possible.

NLU for Healthcare

On the healthcare NLP side, a new ZeroShotRelationExtractionModel is available, which can extract relations between clinical entities in an unsupervised fashion, no training required! Additionally, New French and Italian Deidentification models are available for clinical and healthcare domains. Powerd by the fantastic Spark NLP for helathcare 3.5.0 release

Zero-Shot Relation Extraction

Zero-shot Relation Extraction to extract relations between clinical entities with no training dataset

import nlu

pipe = nlu.load('med_ner.clinical relation.zeroshot_biobert')
# Configure relations to extract
pipe['zero_shot_relation_extraction'].setRelationalCategories({
    "CURE": ["{{TREATMENT}} cures {{PROBLEM}}."],
    "IMPROVE": ["{{TREATMENT}} improves {{PROBLEM}}.", "{{TREATMENT}} cures {{PROBLEM}}."],
    "REVEAL": ["{{TEST}} reveals {{PROBLEM}}."]})
.setMultiLabel(False)
df = pipe.predict("Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer.")
df[
    'relation', 'relation_confidence', 'relation_entity1', 'relation_entity1_class', 'relation_entity2', 'relation_entity2_class',]
# Results in following table :
relation relation_confidence relation_entity1 relation_entity1_class relation_entity2 relation_entity2_class
REVEAL 0.976004 An MRI test TEST cancer PROBLEM
IMPROVE 0.988195 Paracetamol TREATMENT sickness PROBLEM
IMPROVE 0.992962 Paracetamol TREATMENT headache PROBLEM

New Healthcare Models overview

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.relation.zeroshot_biobert re_zeroshot_biobert Relation Extraction ZeroShotRelationExtractionModel
fr fr.med_ner.deid_generic ner_deid_generic De-identification MedicalNerModel
fr fr.med_ner.deid_subentity ner_deid_subentity De-identification MedicalNerModel
it it.med_ner.deid_generic ner_deid_generic Named Entity Recognition MedicalNerModel
it it.med_ner.deid_subentity ner_deid_subentity Named Entity Recognition MedicalNerModel

NLU general

On the general NLP side we have new transformer based DeBERTa v3 sequence classifiers models fine-tuned in Urdu, French and English for Sentiment and News classification. Additionally, 100+ Part Of Speech Taggers and Lemmatizers for 66 Languages and for 7 languages new word2vec embeddings, including hi,azb,bo,diq,cy,es,it,
powered by the amazing Spark NLP 3.4.3 release

New Languages covered:

First time languages covered by NLU are : South Azerbaijani, Tibetan, Dimli, Central Kurdish, Southern Altai, Scottish Gaelic,Faroese,Literary Chinese,Ancient Greek, Gothic, Old Russian, Church Slavic, Old French,Uighur,Coptic,Croatian, Belarusian, Serbian

and their respective ISO-639-3 and ISO 630-2 codes are : azb,bo,diq,ckb, lt gd, fo,lzh,grc,got,orv,cu,fro,qtd,ug,cop,hr,be,qhe,sr

New NLP Models Overview

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.classify.sentiment.imdb.deberta deberta_v3_xsmall_sequence_classifier_imdb Text Classification DeBertaForSequenceClassification
en en.classify.sentiment.imdb.deberta.small deberta_v3_small_sequence_classifier_imdb Text Classification DeBertaForSequenceClassification
en en.classify.sentiment.imdb.deberta.base deberta_v3_base_sequence_classifier_imdb Text Classification DeBertaForSequenceClassification
en en.classify.sentiment.imdb.deberta.large deberta_v3_large_sequence_classifier_imdb Text Classification DeBertaForSequenceClassification
en en.classify.news.deberta deberta_v3_xsmall_sequence_classifier_ag_news Text Classification DeBertaForSequenceClassification
en en.classify.news.deberta.small deberta_v3_small_sequence_classifier_ag_news Text Classification DeBertaForSequenceClassification
ur ur.classify.sentiment.imdb mdeberta_v3_base_sequence_classifier_imdb Text Classification DeBertaForSequenceClassification
fr fr.classify.allocine mdeberta_v3_base_sequence_classifier_allocine Text Classification DeBertaForSequenceClassification
ur ur.embed.bert_cased bert_embeddings_bert_base_ur_cased Embeddings BertEmbeddings
fr fr.embed.bert_5lang_cased bert_embeddings_bert_base_5lang_cased Embeddings BertEmbeddings
de de.embed.medbert bert_embeddings_German_MedBERT Embeddings BertEmbeddings
ar ar.embed.arbert bert_embeddings_ARBERT Embeddings BertEmbeddings
bn bn.embed.bangala_bert bert_embeddings_bangla_bert_base Embeddings BertEmbeddings
zh zh.embed.bert_5lang_cased bert_embeddings_bert_base_5lang_cased Embeddings BertEmbeddings
hi hi.embed.bert_hi_cased bert_embeddings_bert_base_hi_cased Embeddings BertEmbeddings
it it.embed.bert_it_cased bert_embeddings_bert_base_it_cased Embeddings BertEmbeddings
ko ko.embed.bert bert_embeddings_bert_base Embeddings BertEmbeddings
tr tr.embed.bert_cased bert_embeddings_bert_base_tr_cased Embeddings BertEmbeddings
vi vi.embed.bert_cased bert_embeddings_bert_base_vi_cased Embeddings BertEmbeddings
hif hif.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
azb azb.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
bo bo.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
diq diq.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
cy cy.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
es es.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
it it.embed.word2vec w2v_cc_300d Embeddings WordEmbeddingsModel
af af.lemma lemma Lemmatization LemmatizerModel
lt lt.lemma lemma_alksnis Lemmatization LemmatizerModel
nl nl.lemma lemma Lemmatization LemmatizerModel
gd gd.lemma lemma_arcosg Lemmatization LemmatizerModel
es es.lemma lemma Lemmatization LemmatizerModel
ca ca.lemma lemma Lemmatization LemmatizerModel
el el.lemma.gdt lemma_gdt Lemmatization LemmatizerModel
en en.lemma.atis lemma_atis Lemmatization LemmatizerModel
tr tr.lemma.boun lemma_boun Lemmatization LemmatizerModel
da da.lemma.ddt lemma_ddt Lemmatization LemmatizerModel
cs cs.lemma.cac lemma_cac Lemmatization LemmatizerModel
en en.lemma.esl lemma_esl Lemmatization LemmatizerModel
bg bg.lemma.btb lemma_btb Lemmatization LemmatizerModel
id id.lemma.csui lemma_csui Lemmatization LemmatizerModel
gl gl.lemma.ctg lemma_ctg Lemmatization LemmatizerModel
cy cy.lemma.ccg lemma_ccg Lemmatization LemmatizerModel
fo fo.lemma.farpahc lemma_farpahc Lemmatization LemmatizerModel
tr tr.lemma.atis lemma_atis Lemmatization LemmatizerModel
ga ga.lemma.idt lemma_idt Lemmatization LemmatizerModel
ja ja.lemma.gsdluw lemma_gsdluw Lemmatization LemmatizerModel
es es.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
en en.lemma.gum lemma_gum Lemmatization LemmatizerModel
zh zh.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
lv lv.lemma.lvtb lemma_lvtb Lemmatization LemmatizerModel
hi hi.lemma.hdtb lemma_hdtb Lemmatization LemmatizerModel
pt pt.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
de de.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
nl nl.lemma.lassysmall lemma_lassysmall Lemmatization LemmatizerModel
lzh lzh.lemma.kyoto lemma_kyoto Lemmatization LemmatizerModel
zh zh.lemma.gsdsimp lemma_gsdsimp Lemmatization LemmatizerModel
he he.lemma.htb lemma_htb Lemmatization LemmatizerModel
fr fr.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
ro ro.lemma.nonstandard lemma_nonstandard Lemmatization LemmatizerModel
ja ja.lemma.gsd lemma_gsd Lemmatization LemmatizerModel
it it.lemma.isdt lemma_isdt Lemmatization LemmatizerModel
de de.lemma.hdt lemma_hdt Lemmatization LemmatizerModel
is is.lemma.modern lemma_modern Lemmatization LemmatizerModel
la la.lemma.ittb lemma_ittb Lemmatization LemmatizerModel
fr fr.lemma.partut lemma_partut Lemmatization LemmatizerModel
pcm pcm.lemma.nsc lemma_nsc Lemmatization LemmatizerModel
pl pl.lemma.pdb lemma_pdb Lemmatization LemmatizerModel
grc grc.lemma.perseus lemma_perseus Lemmatization LemmatizerModel
cs cs.lemma.pdt lemma_pdt Lemmatization LemmatizerModel
fa fa.lemma.perdt lemma_perdt Lemmatization LemmatizerModel
got got.lemma.proiel lemma_proiel Lemmatization LemmatizerModel
fr fr.lemma.rhapsodie lemma_rhapsodie Lemmatization LemmatizerModel
it it.lemma.partut lemma_partut Lemmatization LemmatizerModel
en en.lemma.partut lemma_partut Lemmatization LemmatizerModel
no no.lemma.nynorsklia lemma_nynorsklia Lemmatization LemmatizerModel
orv orv.lemma.rnc lemma_rnc Lemmatization LemmatizerModel
cu cu.lemma.proiel lemma_proiel Lemmatization LemmatizerModel
la la.lemma.perseus lemma_perseus Lemmatization LemmatizerModel
fr fr.lemma.parisstories lemma_parisstories Lemmatization LemmatizerModel
fro fro.lemma.srcmf lemma_srcmf Lemmatization LemmatizerModel
vi vi.lemma.vtb lemma_vtb Lemmatization LemmatizerModel
qtd qtd.lemma.sagt lemma_sagt Lemmatization LemmatizerModel
ro ro.lemma.rrt lemma_rrt Lemmatization LemmatizerModel
hu hu.lemma.szeged lemma_szeged Lemmatization LemmatizerModel
ug ug.lemma.udt lemma_udt Lemmatization LemmatizerModel
wo wo.lemma.wtb lemma_wtb Lemmatization LemmatizerModel
cop cop.lemma.scriptorium lemma_scriptorium Lemmatization LemmatizerModel
ru ru.lemma.syntagrus lemma_syntagrus Lemmatization LemmatizerModel
ru ru.lemma.taiga lemma_taiga Lemmatization LemmatizerModel
fr fr.lemma.sequoia lemma_sequoia Lemmatization LemmatizerModel
la la.lemma.udante lemma_udante Lemmatization LemmatizerModel
ro ro.lemma.simonero lemma_simonero Lemmatization LemmatizerModel
it it.lemma.vit lemma_vit Lemmatization LemmatizerModel
hr hr.lemma.set lemma_set Lemmatization LemmatizerModel
fa fa.lemma.seraji lemma_seraji Lemmatization LemmatizerModel
tr tr.lemma.tourism lemma_tourism Lemmatization LemmatizerModel
ta ta.lemma.ttb lemma_ttb Lemmatization LemmatizerModel
sl sl.lemma.ssj lemma_ssj Lemmatization LemmatizerModel
sv sv.lemma.talbanken lemma_talbanken Lemmatization LemmatizerModel
uk uk.lemma.iu lemma_iu Lemmatization LemmatizerModel
te te.pos pos_mtg Part of Speech Tagging PerceptronModel
te te.pos pos_mtg Part of Speech Tagging PerceptronModel
ta ta.pos pos_ttb Part of Speech Tagging PerceptronModel
ta ta.pos pos_ttb Part of Speech Tagging PerceptronModel
cs cs.pos pos_ud_pdt Part of Speech Tagging PerceptronModel
cs cs.pos pos_ud_pdt Part of Speech Tagging PerceptronModel
bg bg.pos pos_btb Part of Speech Tagging PerceptronModel
bg bg.pos pos_btb Part of Speech Tagging PerceptronModel
af af.pos pos_afribooms Part of Speech Tagging PerceptronModel
af af.pos pos_afribooms Part of Speech Tagging PerceptronModel
af af.pos pos_afribooms Part of Speech Tagging PerceptronModel
es es.pos.gsd pos_gsd Part of Speech Tagging PerceptronModel
en en.pos.ewt pos_ewt Part of Speech Tagging PerceptronModel
gd gd.pos.arcosg pos_arcosg Part of Speech Tagging PerceptronModel
el el.pos.gdt pos_gdt Part of Speech Tagging PerceptronModel
hy hy.pos.armtdp pos_armtdp Part of Speech Tagging PerceptronModel
pt