JohnSnowLabs / nlu

1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems.
Apache License 2.0
854 stars 130 forks source link

combining 'sentiment' and 'emotion' models causes crash #106

Open jonnyzwi opened 2 years ago

jonnyzwi commented 2 years ago

I'm working in a Google Colab notebook and I set up via

!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

import nlu

a quick version check nlu.version() confirms 3.4.2

Several of the official tutorial notebooks (for ex.: XLNet)) create a multi-model pipeline that includes both 'sentiment' and 'emotion'.

Direct copy of content from the notebook:

import pandas as pd

# Download the dataset 
!wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp

# Load dataset to Pandas
df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')

pipe = nlu.load('sentiment pos xlnet emotion') 

df['text'] = df['comment']

max_rows = 200

predictions = pipe.predict(df.iloc[0:100][['comment','label']], output_level='token')

predictions

However, running a prediction on this pipe results in the following error:


sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
xlnet_base_cased download started this may take some time.
Approximate size to download 417.5 MB
[OK!]
classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-1-9b2e4a06bf65> in <module>()
     34 
     35 # NLU to gives us one row per embedded word by specifying the output level
---> 36 predictions = pipe.predict( df.iloc[0:5][['text','label']], output_level='token' )
     37 
     38 display(predictions)

9 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SentimentDLModel_6c1a68f3f2c7.

Current inputCols: sentence_embeddings@glove_100d. Dataset's columns:
(column_name=text,is_nlp_annotator=false)
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=sentence,is_nlp_annotator=true,type=document)
(column_name=sentence_embeddings@tfhub_use,is_nlp_annotator=true,type=sentence_embeddings).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: sentence_embeddings

Having experimented with various combinations of models, it turns out that the problem is caused whenever 'sentiment' and 'emotion' models are specified in the same pipeline (regardless of pipeline order or what other models are listed).

Running pipe = nlu.load('emotion ANY OTHER MODELS') or pipe = nlu.load('sentiment ANY OTHER MODELS') will be successful, so it really appears to be only a result of combining 'sentiment' and 'emotion'

Is this a known bug? Does anyone have any suggestions for fixing?

My temporary solution has been to run emoPipe = nlu.load('emotion').predict()in isolation, then inner join the resulting dataframe to the the resulting df of pipe = nlu.load('sentiment pos xlnet').predict().

However, I would like to understand better what is failing and to know if there is a way to streamline the inclusion of all models.

Thanks

C-K-Loan commented 2 years ago

Thank you @jonnyzwifor this issue,

this is indeed a bug with the way NLU generates NLP pipelines and we are looking into fixing this