BertEmbeddings doesn't generate an embedding for every token

alex2awesome commented 3 years ago

Hello, this might have a simple answer —

Under what circumstances will BertEmbeddings not generate an embedding for a token?

Description

I am tokenizing the following document, and noticing that the number of embeddings outputted by BertEmbeddings is less than the number of tokens outputted by SparkNLP's tokenizer.

Expected Behavior

I expect that the number of embeddings returned by BERTEmbeddings would be exactly equal to the number of tokens returned by the tokenizer.

Current Behavior

In the example sentence I'll give below, BertEmbeddings returns a list of 906 embeddings for 916 tokens.

Possible Solution

If there is a sensible reason for a token not to be embedded, it would at least be nice for a placeholder to be returned, so the length of the list of embeddings is equal to the length of the list of tokens.

Steps to Reproduce

Input document:

df = pd.DataFrame({'summary': ['Why is this man smiling? President Obama’s chosen successor suffered a devastating loss last week to a man who made a primary campaign issue of Obama’s “disastrous” management of the country. The Democratic Party is in a shambles, outnumbered in state legislatures, governors’ mansions, the House and the Senate. Conservative control of the Supreme Court seems likely for another generation. Obama’s legacy is in tatters, as his trade policy, his foreign policy and his beloved Obamacare are set to be dismantled. And yet when Obama entered the White House briefing room for a post-election news conference Monday afternoon, everything was, if not awesome, then pretty darned good. “We are indisputably in a stronger position today than we were when I came in eight years ago,” he began. “Jobs have been growing for 73 straight months, incomes are rising, poverty is falling, the uninsured rate is at the lowest level on record, carbon emissions have come down without impinging on our growth…” The happy talk kept coming: “Unemployment rate is low as it has been in eight, nine years, incomes and wages have both gone up over the last year faster than they have in a decade or two… The financial systems are stable. The stock market is hovering around its all-time high and 401(k)s have been restored. The housing market has recovered… We are seeing significant progress in Iraq. .. Our alliances are in strong shape. ..And gas is two bucks a gallon.” It’s all true enough. But Obama’s post-election remarks seemed utterly at odds with the national mood. Half the country is exultant because Donald Trump has promised to undo everything Obama has done over the last year. The other half of the country is alarmed that a new age of bigotry and inwardness has seized the country. And here’s the outgoing president, reciting what a fine job he has done. This has been Obama’s pattern. At times when passion is called for, he’s cerebral and philosophical and taking the long view — so long that it frustrates those living in the present. A week after an election has left his supporters reeling, Obama’s focus seemed to be squarely on his own legacy. He didn’t mention Hillary Clinton’s name once in his news conference, and he went out of his way to praise Trump. On a day when the country was digesting the news that Trump has named as his top White House strategist Stephen K. Bannon, a man who has boasted of his ties to the racist “alt-right,” Obama was generous to the “carnival barker” who led the campaign questioning his American birth. Of the Bannon appointment, Obama said “it would not be appropriate for me to comment,” and “those who didn’t vote for him have to recognize that that’s how democracy works.” Of Trump himself, Obama noted “his gifts that obviously allowed him to execute one of the biggest political upsets in history.” He praised Trump as “gregarious” and “pragmatic,” a man who favors “a vigorous debate” and was “impressive” during the campaign. “That connection that he was able to make with his supporters,” Obama said, was “powerful stuff.” Obama’s above-the-fray response to the election result may well be that of a man who believes his approach will be vindicated by history. It may well be, but that is of little comfort now. As Obama retires to a life of speaking fees and good works, he sounded less concerned about what will happen next than with what he had achieved — including a mention, for those who forgot, that he won the Iowa caucus in 2008. He took a bow for his “smartest, hardest-working” staff, his “good decisions,” the absence of “significant scandal” during his tenure. And he speculated that Trump would ultimately find it wise to leave intact the key achievements of his administration: Obamacare, the Iran nuclear deal, the Paris climate accord, trade and immigration. The deep disenchantment among white, blue-collar voters that propelled Trump won only a passing mention. “Obviously there are people out there who are feeling deeply disaffected,” the president said with his cool detachment. In an election this close — Clinton, let’s not forget, won the popular vote — any factor could have made the difference: being a candidate of the establishment in a time of change, resistance to a woman as president and backlash against the first black president, and James Comey’s last-minute intervention in the election. But millions of Americans are justifiably anxious about their economic well-being. And if Clinton and Obama had limited the build-on-success theme during the campaign in favor of a more populist vision and policies, they really would have something to smile about this week. Twitter: @Milbank Read more from Dana Milbank’s archive, follow him on Twitter or subscribe to his updates on Facebook.']})

documenter = (
    sb.DocumentAssembler()
        .setInputCol("summary")
        .setOutputCol("document")
)

sentencer = (
    sa.SentenceDetector()
        .setInputCols(["document"])
        .setOutputCol("sentences")            
)

tokenizer = (
    sa.Tokenizer()
        .setInputCols(["sentences"])
        .setOutputCol("token")
)

word_embeddings = (
    sa.BertEmbeddings
        .load('s3://aspangher/spark-nlp/small_bert_L4_128_en_2.6.0_2.4')
        .setInputCols(["sentences", "token"])
        .setOutputCol("embeddings")
        .setMaxSentenceLength(512)
        .setBatchSize(100)
)

tok_finisher = (
    sb.Finisher()
    .setInputCols(["token"])
    .setIncludeMetadata(True)
)

embeddings_finisher = (
    sb.EmbeddingsFinisher()
            .setInputCols("embeddings")
            .setOutputCols("embeddings_vectors")
            .setOutputAsVector(True)
)

sparknlp_processing_pipeline = sb.RecursivePipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    word_embeddings,
    embeddings_finisher,
    tok_finisher
  ]
)

sdf = spark.createDataFrame(df)
spark_processed_df = sparknlp_processing_pipeline.fit(sdf).transform(sdf)
t = spark_processed_df.toPandas()
len(t['embeddings_vectors'].iloc[0])
>>> 906

len(t['finished_token_metadata'][0])
>>> 916

Context

I am trying to match sentences by overall word similarity, and am zipping up tokens and embeddings. Because the number of embeddings is different than the number of tokens, the last few tokens are given a None embedding

Your Environment

Spark NLP version sparknlp.version(): com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.5
Apache NLP version spark.version: '2.4.7-ds-0.5'
Setup and installation (Pypi, Conda, Maven, etc.): SparkNLP: Maven

alex2awesome commented 3 years ago

Update:

If I process the sentence first locally using python's unidecode package, then the number of embeddings equals the number of tokens. Based on this, I believe that BERTEmbeddings skips some unicode characters that are not in-vocabulary.

Understandable, but still unexpected. Might be worth noting in documentation.

df = df.assign(summary=lambda df: df['summary'].apply(unidecode.unidecode)) sdf = spark.createDataFrame(df) spark_processed_df = sparknlp_processing_pipeline.fit(sdf).transform(sdf) t = spark_processed_df.toPandas() len(t['embeddings_vectors'].iloc[0])

939

len(t['finished_token_metadata'][0])

939

maziyarpanahi commented 3 years ago

Hi @alex2awesome

Thanks for reporting this and the update. Let me take a closer look at this and see what happens:

with the newer version of Spark NLP
on the Scala side (is it only Python encoding or does it happen on the Scala side too)
and see why BERT skips those characters. (It's hard to say since BERT uses subword vocabularies, I debug to see what goes in and what comes out internally).

I'll update this issue with more information.

alex2awesome commented 3 years ago

Thank you @maziyarpanahi :) always nice when the issues I report are actually useful and unknown rather than simply me missing something in the documentation.

Hopefully it is an out-of-vocab issue!

maziyarpanahi commented 3 years ago

@alex2awesome

The Scala tests:

val smallCorpus = Seq(
      "Why is this man smiling? President Obama’s chosen successor suffered a devastating loss last week to a man who made a primary campaign issue of Obama’s “disastrous” management of the country. The Democratic Party is in a shambles, outnumbered in state legislatures, governors’ mansions, the House and the Senate. Conservative control of the Supreme Court seems likely for another generation. Obama’s legacy is in tatters, as his trade policy, his foreign policy and his beloved Obamacare are set to be dismantled. And yet when Obama entered the White House briefing room for a post-election news conference Monday afternoon, everything was, if not awesome, then pretty darned good. “We are indisputably in a stronger position today than we were when I came in eight years ago,” he began. “Jobs have been growing for 73 straight months, incomes are rising, poverty is falling, the uninsured rate is at the lowest level on record, carbon emissions have come down without impinging on our growth…” The happy talk kept coming: “Unemployment rate is low as it has been in eight, nine years, incomes and wages have both gone up over the last year faster than they have in a decade or two… The financial systems are stable. The stock market is hovering around its all-time high and 401(k)s have been restored. The housing market has recovered… We are seeing significant progress in Iraq. .. Our alliances are in strong shape. ..And gas is two bucks a gallon.” It’s all true enough. But Obama’s post-election remarks seemed utterly at odds with the national mood. Half the country is exultant because Donald Trump has promised to undo everything Obama has done over the last year. The other half of the country is alarmed that a new age of bigotry and inwardness has seized the country. And here’s the outgoing president, reciting what a fine job he has done. This has been Obama’s pattern. At times when passion is called for, he’s cerebral and philosophical and taking the long view — so long that it frustrates those living in the present. A week after an election has left his supporters reeling, Obama’s focus seemed to be squarely on his own legacy. He didn’t mention Hillary Clinton’s name once in his news conference, and he went out of his way to praise Trump. On a day when the country was digesting the news that Trump has named as his top White House strategist Stephen K. Bannon, a man who has boasted of his ties to the racist “alt-right,” Obama was generous to the “carnival barker” who led the campaign questioning his American birth. Of the Bannon appointment, Obama said “it would not be appropriate for me to comment,” and “those who didn’t vote for him have to recognize that that’s how democracy works.” Of Trump himself, Obama noted “his gifts that obviously allowed him to execute one of the biggest political upsets in history.” He praised Trump as “gregarious” and “pragmatic,” a man who favors “a vigorous debate” and was “impressive” during the campaign. “That connection that he was able to make with his supporters,” Obama said, was “powerful stuff.” Obama’s above-the-fray response to the election result may well be that of a man who believes his approach will be vindicated by history. It may well be, but that is of little comfort now. As Obama retires to a life of speaking fees and good works, he sounded less concerned about what will happen next than with what he had achieved — including a mention, for those who forgot, that he won the Iowa caucus in 2008. He took a bow for his “smartest, hardest-working” staff, his “good decisions,” the absence of “significant scandal” during his tenure. And he speculated that Trump would ultimately find it wise to leave intact the key achievements of his administration: Obamacare, the Iran nuclear deal, the Paris climate accord, trade and immigration. The deep disenchantment among white, blue-collar voters that propelled Trump won only a passing mention. “Obviously there are people out there who are feeling deeply disaffected,” the president said with his cool detachment. In an election this close — Clinton, let’s not forget, won the popular vote — any factor could have made the difference: being a candidate of the establishment in a time of change, resistance to a woman as president and backlash against the first black president, and James Comey’s last-minute intervention in the election. But millions of Americans are justifiably anxious about their economic well-being. And if Clinton and Obama had limited the build-on-success theme during the campaign in favor of a more populist vision and policies, they really would have something to smile about this week. Twitter: @Milbank Read more from Dana Milbank’s archive, follow him on Twitter or subscribe to his updates on Facebook."
    ).toDF("text")

    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")

    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

    val embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
      .setMaxSentenceLength(512)

    val embedFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("embeddings_vectors")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)

    val tokenFinisher = new Finisher()
      .setInputCols("token")
      .setOutputCols("finished_token")
      .setIncludeMetadata(true)
      .setCleanAnnotations(false)

    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        embeddings,
        embedFinisher,
        tokenFinisher
      ))

    val pipelineDF = pipeline.fit(smallCorpus).transform(smallCorpus)

    println("missing tokens/embeddings: ")
    pipelineDF.withColumn("sentence_size", size(col("sentence")))
      .withColumn("token_size", size(col("token")))
      .withColumn("embed_size", size(col("embeddings")))
      .where(col("token_size") =!= col("embed_size"))
      .select("sentence_size", "token_size", "embed_size", "token.result", "embeddings.result")
      .show(false)

    println("total sentences: ", pipelineDF.select(explode($"sentence.result")).count)
    val totalTokens = pipelineDF.select(explode($"token.result")).count.toInt
    val totalEmbeddings = pipelineDF.select(explode($"embeddings.embeddings")).count.toInt

    val totalTokensFinisher = pipelineDF.select(explode($"finished_token")).count.toInt
    val totalEmbeddingsFinisher = pipelineDF.select(explode($"embeddings_vectors")).count.toInt

    println(s"total tokens: $totalTokens")
    println(s"total embeddings: $totalEmbeddings")

    println(s"total tokens finisher: $totalTokensFinisher")
    println(s"total embeddings: $totalEmbeddingsFinisher")

    assert(totalTokens == totalEmbeddings)
    assert(totalTokensFinisher == totalEmbeddingsFinisher)

The results (nothing is missing):

missing tokens/embeddings: 
+-------------+----------+----------+------+------+
|sentence_size|token_size|embed_size|result|result|
+-------------+----------+----------+------+------+
+-------------+----------+----------+------+------+

(total sentences: ,42)
total tokens: 888
total embeddings: 888
total tokens finisher: 888
total embeddings: 888

The Python test:

https://colab.research.google.com/drive/1JLezGC5LcY_NkGIlSWzPkZHQdllWbtqe?usp=sharing

print(len(t['embeddings_vectors'].iloc[0]))
print(len(t['finished_token_metadata'][0]))
888
888

This must have something to do with your system-wide encoding (Python Environment default encoding). In any case, I cannot reproduce this to debug it.

alex2awesome commented 3 years ago

Ackkk I'm so sorry. The rows in t (output from spark) were not in the same order as rows from df (my input), so I accidentally sent you text from the wrong input row. I was wondering why the number of tokens, 888, was different.

Please try again with the following text:

df = pd.DataFrame({'summary': ['Why is this man smiling? President Obama’s chosen successor suffered a devastating loss last week to a man who made a primary campaign issue of Obama’s “disastrous” management of the country. The Democratic Party is in a shambles, outnumbered in state legislatures, governors’ mansions, the House and the Senate. Conservative control of the Supreme Court seems likely for another generation. Obama’s legacy is in tatters, as his trade policy, his foreign policy and his beloved Obamacare are set to be dismantled. And yet when Obama entered the White House briefing room for a post-election news conference Monday afternoon, everything was, if not awesome, then pretty darned good. “We are indisputably in a stronger position today than we were when I came in eight years ago,” he began. “Jobs have been growing for 73 straight months, incomes are rising, poverty is falling, the uninsured rate is at the lowest level on record, carbon emissions have come down without impinging on our growth .\u2009.\u2009.” The happy talk kept coming: “Unemployment rate is low as it has been in eight, nine years, incomes and wages have both gone up over the last year faster than they have in a decade or two. .\u2009.\u2009. The financial systems are stable. The stock market is hovering around its all-time high and 401(k)s have been restored. The housing market has recovered. .\u2009.\u2009. We are seeing significant progress in Iraq .\u2009.\u2009. Our alliances are in strong shape. .\u2009.\u2009. And gas is two bucks a gallon.” It’s all true enough. But Obama’s post-election remarks seemed utterly at odds with the national mood. Half the country is exultant because Donald Trump has promised to undo everything Obama has done over the past eight years. The other half of the country is alarmed that a new age of bigotry and inwardness has seized the country. And here’s the outgoing president, reciting what a fine job he has done. This has been Obama’s pattern. At times when passion is called for, he’s cerebral and philosophical and taking the long view — so long that it frustrates those living in the present. A week after an election has left his supporters reeling, Obama’s focus seemed to be squarely on his own legacy. He didn’t mention Hillary Clinton’s name once in his news conference, and he went out of his way to praise Trump. On a day when the country was digesting the news that Trump has named as his top White House strategist Stephen K. Bannon, a man who has boasted of his ties to the racist “alt-right,” Obama was generous to the “carnival barker” who led the campaign questioning his American birth. Of the Bannon appointment, Obama said “it would not be appropriate for me to comment,” and “those who didn’t vote for him have to recognize that that’s how democracy works.” Of Trump himself, Obama noted “his gifts that obviously allowed him to execute one of the biggest political upsets in history.” He praised Trump as “gregarious” and “pragmatic,” a man who favors “a vigorous debate” and was “impressive” during the campaign. “That connection that he was able to make with his supporters,” Obama said, was “powerful stuff.” Obama’s above-the-fray response to the election result may well be that of a man who believes his approach will be vindicated by history. It may well be, but that is of little comfort now. As Obama retires to a life of speaking fees and good works, he sounded less concerned about what will happen next than with what he had achieved — including a mention, for those who forgot, that he won the Iowa caucuses in 2008. He took a bow for his “smartest, hardest-working” staff, his “good decisions,” the absence of “significant scandal” during his tenure. And he speculated that Trump would ultimately find it wise to leave intact the key achievements of his administration: Obamacare, the Iran nuclear deal, the Paris climate accord, trade and immigration. The deep disenchantment among white, blue-collar voters that propelled Trump won only a passing mention. “Obviously there are people out there who are feeling deeply disaffected,” the president said with his cool detachment. In an election this close — Clinton, let’s not forget, won the popular vote — any factor could have made the difference: being a candidate of the establishment in a time of change, resistance to a woman as president and backlash against the first black president, and FBI Director James B. Comey’s last-minute intervention in the election. But millions of Americans are justifiably anxious about their economic well-being. And if Clinton and Obama had limited the build-on-success theme during the campaign in favor of a more populist vision and policies, they really would have something to smile about this week. Twitter: @Milbank Read more from Dana Milbank’s archive, follow him on Twitter or subscribe to his updates on Facebook.']})

I am able to replicate my earlier reported Python results in your collaboratory notebook with this input.

danilojsl commented 3 years ago

Hi @alex2awesome

I reproduced the error. The issue is with unicode spaces which are sent through the pipeline when a spark dataframe is created from a pandas dataframe (i.e sdf = spark.createDataFrame(df)). Tokenizer annotator is not able to handle it and does not consider unicode white space as a character for tokenization. However, you can use RegexTokenizer to make this work. So, you just need to define a Regex Tokenizer in a way that can handle unicode white spaces like this:

val tokenizer = new RegexTokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
      .setPattern("\\p{Zs}")

I tested adding it as a tokenizer for the pipeline above and got the same number of tokens in Scala and Python as well

total sentences: 48
total tokens: 817
total embeddings: 817
total tokens finisher: 817
total embeddings finisher: 817

alex2awesome commented 3 years ago

This is great —

My suggestion is to somehow bake this into the Tokenizer annotator. This took hours to diagnose on my end before I opened this issue, and I was also lucky to have looked at an example where this issue was present. I don't think any user is going to know going into building their pipeline that this will be an issue.

I think that when the Tokenizer doesn't know how to annotate any character, it should either throw an error or a warning and output a NaN or something that downstream components know how to deal with. The fact that Tokenizer just silently drops these characters probably causes a lot of bugs for people interested in doing word-level analysis like I was (i.e. needing to match embeddings to words).

maziyarpanahi commented 2 years ago

This is actually an issue inside BertEmbeddings annotators when it tries to re-align the custom tokens back to the token pieces with their vectors.

I'll see if we can fix this in 3.3.1 release

Aditi00a commented 2 years ago

I am still getting this issue with spark nlp v 3.3.1

Description

I am tokenizing the following document, and noticing that the number of embeddings outputted by BertEmbeddings is less than the number of tokens outputted by SparkNLP's tokenizer.

Expected Behavior

I expect that the number of embeddings returned by BERTEmbeddings would be exactly equal to the number of tokens returned by the tokenizer.

Current Behavior

In the example sentence I'll give below, BertEmbeddings returns a list of 98 embeddings for 124 tokens. I have also tried a solution given in the thread above applying unidecode, however that isn't working either

Possible Solution

If there is a sensible reason for a token not to be embedded, it would at least be nice for a placeholder to be returned, so the length of the list of embeddings is equal to the length of the list of tokens.

Steps to Reproduce

pretrained_model = "bert_base_uncased"
text= ['White Hill Music presents the Latest Punjabi Songs 2021 ""Jeena Paauni Aa"" from the album Jug ni by Maninder Buttar Lyrics by Maninder Buttar Music by MixSingh A Film By Drip Films Produced By Gunbir Singh Sidhu & Manmord Sidhu Listen Full Album Jug ni on Spotify Apple Music iTunes Gaana Amazon Music Amazon Prime Music YouTube Music Wynk Hungama Resso Operator Codes * Airtel & Airtel Hellotune Link Airtel Subscribers to Set as Hello tune Click on Wynk music link * Set Vi CRBT Click Vi Subscribers for Caller Tune Direct Dial 9 * BSNL N & W Direct Dial BSNL N & W Subscribers Direct Dial 89 Credits Song Jeena Paauni Aa Singer/Lyrics/Composer Maninder Buttar Music Director MixSingh Music Composed']
test_df = spark.createDataFrame(text, "string").toDF("text")
test_df.show()
document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(['document']) \
    .setOutputCol('sentence') 

token = Tokenizer() \
    .setInputCols(['sentence']) \
    .setOutputCol('token')

embeddings = BertEmbeddings.pretrained(pretrained_model, 'en'). \
    setInputCols(["sentence", 'token']). \
    setOutputCol("embeddings"). \
    setBatchSize(8)

ner_prediction_pipeline = Pipeline(
    stages=[
        document,
        sentence,
        token,
        embeddings
    ])

columns = StructType([StructField('text',
                                  StringType(), True)])

empty_data = spark.createDataFrame(data=[],
                            schema=columns)

prediction_model = ner_prediction_pipeline.fit(empty_data)
predictions = prediction_model.transform(test_df)

predictions_df = predictions. \
    selectExpr("document", "sentence","token", "embeddings.result embeddings",'token.result bert_token'). \
    withColumn('embeddings_count',sf.size('embeddings')). \
    withColumn('token_count', sf.size('bert_token')). \
    withColumn('count_difference', sf.col('token_count') - sf.col('embeddings_count'))
predictions_df.show()

OUTPUT - 
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+
|            document|            sentence|               token|          embeddings|          bert_token|embeddings_count|token_count|count_difference|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+
|[[document, 0, 70...|[[document, 0, 70...|[[token, 0, 4, Wh...|[white, hill, mus...|[White, Hill, Mus...|              98|        124|              26|
+--------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------+----------------+

Your Environment

Spark NLP version sparknlp.version(): Spark NLP version: 3.3.1

Apache NLP version spark.version: Apache Spark version: 3.0.3

Setup and installation (Pypi, Conda, Maven, etc.): google colabs !wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

maziyarpanahi commented 2 years ago

This issue was regarding Unicode characters missing from BertEmbeddings output. It was resolved in Spark NLP 3.3.1 release. (I address your question in your issue: https://github.com/JohnSnowLabs/spark-nlp/issues/6367)

JohnSnowLabs / spark-nlp