JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 713 forks source link

BertEmbeddings doesn't generate an embedding for every token #6367

Closed Aditi00a closed 3 years ago

Aditi00a commented 3 years ago

I am still getting this issue with spark nlp v 3.3.1

Description

I am tokenizing the following document, and noticing that the number of embeddings outputted by BertEmbeddings is less than the number of tokens outputted by SparkNLP's tokenizer.

Expected Behavior

I expect that the number of embeddings returned by BERTEmbeddings would be exactly equal to the number of tokens returned by the tokenizer.

Current Behavior

In the example sentence I'll give below, BertEmbeddings returns a list of 98 embeddings for 124 tokens. I have also tried a solution given in the thread above applying unidecode, however that isn't working either

Possible Solution

If there is a sensible reason for a token not to be embedded, it would at least be nice for a placeholder to be returned, so the length of the list of embeddings is equal to the length of the list of tokens.

Steps to Reproduce

pretrained_model = "bert_base_uncased"
text= ['White Hill Music presents the Latest Punjabi Songs 2021 ""Jeena Paauni Aa"" from the album Jug ni by Maninder Buttar Lyrics by Maninder Buttar Music by MixSingh A Film By Drip Films Produced By Gunbir Singh Sidhu & Manmord Sidhu Listen Full Album Jug ni on Spotify Apple Music iTunes Gaana Amazon Music Amazon Prime Music YouTube Music Wynk Hungama Resso Operator Codes * Airtel & Airtel Hellotune Link Airtel Subscribers to Set as Hello tune Click on Wynk music link * Set Vi CRBT Click Vi Subscribers for Caller Tune Direct Dial 9 * BSNL N & W Direct Dial BSNL N & W Subscribers Direct Dial 89 Credits Song Jeena Paauni Aa Singer/Lyrics/Composer Maninder Buttar Music Director MixSingh Music Composed']
test_df = spark.createDataFrame(text, "string").toDF("text")
test_df.show()
document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(['document']) \
    .setOutputCol('sentence') 

token = Tokenizer() \
    .setInputCols(['sentence']) \
    .setOutputCol('token')

embeddings = BertEmbeddings.pretrained(pretrained_model, 'en'). \
    setInputCols(["sentence", 'token']). \
    setOutputCol("embeddings"). \
    setBatchSize(8)

ner_prediction_pipeline = Pipeline(
    stages=[
        document,
        sentence,
        token,
        embeddings
    ])

columns = StructType([StructField('text',
                                  StringType(), True)])

empty_data = spark.createDataFrame(data=[],
                            schema=columns)

prediction_model = ner_prediction_pipeline.fit(empty_data)
predictions = prediction_model.transform(test_df)

predictions_df = predictions. \
    selectExpr("document", "sentence","token", "embeddings.result embeddings",'token.result bert_token'). \
    withColumn('embeddings_count',sf.size('embeddings')). \
    withColumn('token_count', sf.size('bert_token')). \
    withColumn('count_difference', sf.col('token_count') - sf.col('embeddings_count'))
predictions_df.show()

OUTPUT - 
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+
|            document|            sentence|               token|          embeddings|          bert_token|embeddings_count|token_count|count_difference|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+
|[[document, 0, 70...|[[document, 0, 70...|[[token, 0, 4, Wh...|[white, hill, mus...|[White, Hill, Mus...|              98|        124|              26|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+

Your Environment

Spark NLP version sparknlp.version(): Spark NLP version: 3.3.1

Apache NLP version spark.version: Apache Spark version: 3.0.3

Setup and installation (Pypi, Conda, Maven, etc.): google colabs !wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

maziyarpanahi commented 3 years ago

There is no issue with BertEmbeddings, please increase .setMaxSentenceLength(512): https://colab.research.google.com/drive/1JLezGC5LcY_NkGIlSWzPkZHQdllWbtqe?usp=sharing

Aditi00a commented 3 years ago

There is no issue with BertEmbeddings, please increase .setMaxSentenceLength(512): https://colab.research.google.com/drive/1JLezGC5LcY_NkGIlSWzPkZHQdllWbtqe?usp=sharing

Despite setting the setMaxSentenceLength(512) parameter - Still encountering the same issue on the following text -

text='ntr new hindi dubbed movie 2021, jr ntr new hindi dubbed movie 2021, sauth new movie 2021 hindi dubbed ntr, new south movie 2021 hindi dubbed ntr, ntr new movie 2021 hindi dubbed love story, ntr new movie 2021 hindi dubbed ravan, ntr new movie 2021 hindi dubbed rrr, ntr new movie 2021 hindi dubbed hd, ntr new hindi dubbed movies, ntr new hindi dubbed movie 2019, ntr new release movie hindi dubbed, n t r new movie hindi dubbed, ntr all new movie hindi dubbed, ntr new hindi dubbed, ntr new movie 2021 hindi dubbed comedy, ntr new hindi dubbed movie, new released full hindi dubbed movie 2021 ntr, jr ntr new released full hindi dubbed movie 2021, jr ntr 2021 new hindi dubbed blockbuster movie, jr. ntr 2020 new telugu hindi dubbed blockbuster movie 2021 south hindi dubbed movies, jr.ntr 2021 new telugu hindi dubbed movie, ntr new movie 2021 hindi dubbed hindi, jr ntr new movie hindi dubbed 2021, ntr new movie 2021 hindi dubbed kajal agarwal, ntr new movie 2020 hindi dubbed new 2021, ntr new movie 2021 hindi dubbed new, ntr new movie 2021 hindi dubbed rakul preet, ntr new movie 2021 hindi dubbed police, ntr new movie 2021 hindi dubbed triple role, ntr new movie 2021 hindi dubbed trailer, ntr new movie 2020 hindi dubbed 2021, ntr new movie 2017 hindi dubbed, new south Movies,South Indian Suspense Thriller Movies In Hindi, south full movie in hindi dubbed, South Murder mystery Thriller Movies In Hindi, Psychological Suspense Thriller Films, Best South Indian Suspense Thriller Movie in Hindi, Top 10 Biggest South New Release Crime Suspense Thriller Movies In Hindi, dubbed movies, new,hindi dubbed movie, latest movies, new full hindi movie 2020,south indian movies dubbed in hindi full movie 2020 new, south indian movie in hindi dubbed full 2020 hd, new tamil 2020 full movie hindi dubbed, doraemon movie 2020 in hindi full movie, new hollywood movies 2020 full movie in hindi hd, south indian movies dubbed in hindi full, new bollywood movies 2020, hindi and dubbed movie, new full hindi movie 2020, South th indian movies dubbed in hindi full movie 2020 new, south indian movie in hindi dubbed full 2020new hd, new tamil movies 2020 full movie in hindi dubbed, new movie 2020 in hindi full movie, new hollywood movies 2020 full movie in hindi hd, south indian movies dubbed in hindi full movie 2020 new hd, new bollywood movies 2020 full movie in hindi hd, new hindi dubbed movies, hindi movies, dubbed movies, new hindi movie, south indian movies dubbed in7 hindi full movie new, new south indian movies dubbed, new south indian movies dubbed, hindi dubbed movies, hindi dubbed movie 2020 new hindi dubbed movies, hindi dubbed movie 2020 new release, hindi dubbed movie 2020 new, hindi dubbed movie 2020 allu arjun, south indian movie hindi dubbed movie 2020, south indian movies dubbed in hindi full movie 2019 new, 2020 new hindi dubbed movies, new south indian movies dubbed in hindi 2020 full, new south indian movie hindi dubbed movie 2020, new hindi dubbed movie 2020 south indian movies dubbed in hindi full movie new, new south indian movie in hindi dubbed, south indian dubbed in hindi full movie 2018 new, 2019 new hindi dubbed movies, movies hindi dubbed 2020 new hindi dubbed movies, allu arjun, south indian indian movies dubbed in hindi full movie 2020 new, 2020 south indian movie hindi dubbed, action movies, bollywood movies, new hindi movie 2020, latest movie 2020, action movies, new hindi movie 2020, latest movie 2020, action hindi movie, movies, in hindi dubbed, Mahesh babu movies, new south indian movies, full hindi dubbed movie, dubbed hindi full movie, hindi dubbed full movie, hindi dubbed movies 2020, hindi dubbed movies, full hindi dubbed movie 2020, 2020 new hindi dubbed movies, new dubbed movie 2020, 2020 new hindi movies, new blockbuster Hindi Dubbed Movie, 2020 south indian full hindi action movies, south indian movies 2020, latest dubbed movies in hindi full 2020 full hindi dubbed movie, south action movies, new released hindi dubbed movie, blockbuster Hindi Dubbed Movie, south superstar movies,2019 dubbed hindi movies, movie dubbed, south dubbed movies 2020, action dubbed movies, latest dubbed movies, 2020 dubbed movies, 2020 dubbed movie, south dubbed movies, dubbed movie, goldmine telefilms new, new blockbuster Hindi Dubbed Movie, dubbed action movie, bollywood full movie 2019 new, south movies 2020, south movies in hindi, hindi dubbed movies 2019 full movie, action romantic movies in hindi dubbed, dubbed movies, hindi movies, 2020 south movies in hindi, full movies in hindi dubbed, new south movieBharat full movie in hindi dubbed, hindi dubbed, south movie, south movies, south indian movies dubbed in hindi full movie 2020 new, new south movie 2020 hindi dubbed, Mahesh babu, new south movie 2019, movies 2020 full movies, south movie in hindi 2020, movies 2020 full movie, new hindi movie 2020, new south indian movies dubbed in hindi 2020 full,New South Indian movies Dubbed in Hindi full'

maziyarpanahi commented 3 years ago

There is no bug here, this is just 1 giant sentence which is obviously larger than 512! Anything larger than 512 will be trimmed/truncated like all the other transformer models.

You can use other sentence detector annotators/modes for better sentence boundary detection, but by the looks of it this is just a simple copy/paste for a testing purposes. If you have a decent text that has detectable sentences with each less than 512 tokens you should have the same amount of tokens=vectors, otherwise it will trim it and it's not a bug.

image