Unequal outputs of RobertaTokenizer from the output of RobertaTokenizer in HuggingFace

LIN-Yu-Ting commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am working on HuggingFace model deepset/roberta-base-squad2. I can not obtain the same tokenized output from both HuggingFace and SparkNLP.

Current Behavior

I executed the following codes and obtained an output of Tokenizer

from transformers import RobertaTokenizerFast 

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base", add_prefix_space=False)
inputs = tokenizer("The report ATGOncoX (ID: WW-23-02011_ONC) for the patient 吳天恩 (patient ID K112-00000_PP23039) is issued on Apr 21, 2023.")
inputs

{'input_ids': [0, 133, 266, 3263, 534, 4148, 876, 1000, 36, 2688, 35, 15584, 12, 1922, 12, 288, 22748, 1215, 2191, 347, 43, 13, 5, 3186, 47111, 16948, 15264, 49429, 37127, 10172, 15375, 36, 23846, 4576, 229, 17729, 12, 25034, 1215, 5756, 20352, 3416, 43, 16, 1167, 15, 14830, 733, 6, 291, 1922, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

However, this result is different from the outputs of RobertaTokenizer in Spark NLP when I call function tokenizeDocument

0 = {TokenPiece@18528} TokenPiece(The,The,133,true,0,2)
1 = {TokenPiece@18529} TokenPiece(Ġreport,report,266,true,3,9)
2 = {TokenPiece@18530} TokenPiece(ĠAT,ATGOncoX,3263,true,10,12)
3 = {TokenPiece@18531} TokenPiece(G,ATGOncoX,534,false,13,13)
4 = {TokenPiece@18532} TokenPiece(On,ATGOncoX,4148,false,14,15)
5 = {TokenPiece@18533} TokenPiece(co,ATGOncoX,876,false,16,17)
6 = {TokenPiece@18534} TokenPiece(X,ATGOncoX,1000,false,18,18)
7 = {TokenPiece@18535} TokenPiece(Ġ(,(,36,true,19,20)
8 = {TokenPiece@18536} TokenPiece(ID,ID,4576,true,21,22)
9 = {TokenPiece@18537} TokenPiece(:,:,4832,true,23,23)
10 = {TokenPiece@18538} TokenPiece(ĠWW,WW,15584,true,24,26)
11 = {TokenPiece@18539} TokenPiece(-,-,111,true,27,27)
12 = {TokenPiece@18540} TokenPiece(23,23,883,true,28,29)
13 = {TokenPiece@18541} TokenPiece(-,-,111,true,30,30)
14 = {TokenPiece@18542} TokenPiece(0,02011,321,true,31,31)
15 = {TokenPiece@18543} TokenPiece(2011,02011,22748,false,32,35)
16 = {TokenPiece@18544} TokenPiece(_,_,18134,true,36,36)
17 = {TokenPiece@18545} TokenPiece(ON,ONC,5121,true,37,38)
18 = {TokenPiece@18546} TokenPiece(C,ONC,347,false,39,39)
19 = {TokenPiece@18547} TokenPiece(),),4839,true,40,40)
20 = {TokenPiece@18548} TokenPiece(Ġfor,for,13,true,41,44)
21 = {TokenPiece@18549} TokenPiece(Ġthe,the,5,true,45,48)
22 = {TokenPiece@18550} TokenPiece(Ġpatient,patient,3186,true,49,56)
23 = {TokenPiece@18551} TokenPiece( 吳天恩, 吳天恩,3,true,57,60)
24 = {TokenPiece@18552} TokenPiece(Ġ(,(,36,true,61,62)
25 = {TokenPiece@18553} TokenPiece(patient,patient,3186,true,63,69)
26 = {TokenPiece@18554} TokenPiece(ĠID,ID,4576,true,70,72)
27 = {TokenPiece@18555} TokenPiece(ĠK,K,229,true,73,74)
28 = {TokenPiece@18556} TokenPiece(112,112,12730,true,75,77)
29 = {TokenPiece@18557} TokenPiece(-,-,111,true,78,78)
30 = {TokenPiece@18558} TokenPiece(00000,00000,3,true,79,83)
31 = {TokenPiece@18559} TokenPiece(_,_,18134,true,84,84)
32 = {TokenPiece@18560} TokenPiece(PP,PP,18390,true,85,86)
33 = {TokenPiece@18561} TokenPiece(230,23039,16242,true,87,89)
34 = {TokenPiece@18562} TokenPiece(39,23039,3416,false,90,91)
35 = {TokenPiece@18563} TokenPiece(),),4839,true,92,92)
36 = {TokenPiece@18564} TokenPiece(Ġis,is,16,true,93,95)
37 = {TokenPiece@18565} TokenPiece(Ġissued,issued,1167,true,96,102)
38 = {TokenPiece@18566} TokenPiece(Ġon,on,15,true,103,105)
39 = {TokenPiece@18567} TokenPiece(ĠApr,Apr,14830,true,106,109)
40 = {TokenPiece@18568} TokenPiece(Ġ21,21,733,true,110,112)
41 = {TokenPiece@18569} TokenPiece(,,,,2156,true,113,113)
42 = {TokenPiece@18570} TokenPiece(Ġ20,2023,291,true,114,116)
43 = {TokenPiece@18571} TokenPiece(23,2023,1922,false,117,118)
44 = {TokenPiece@18572} TokenPiece(.,.,479,true,119,119)

I believe that this is the root cause that I can not predict the same results after I import a huggingface model into Spark NLP by referencing the tutorial HuggingFace in Spark NLP - RoBertaForQuestionAnswering.ipynb

Expected Behavior

I expect that both of them should return same list of token_ids. However, I do not know which one is correct. But this do impact the prediction performance when I want to execute a HuggingFace model on Spark NLP.

Steps To Reproduce

You can use the Spark NLP Unit Test RoBertaForQuestionAnsweringTestSpec to reproduce this issue.

Spark NLP version and Apache Spark

spark NLP 4.4.4

Type of Spark Application

No response

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 1 year ago

Spark NLP uses a custom Tokenizer/RegexTokenize where you control how to first tokenize/split the text into meaningful tokens, then internally each tokens gets encoded/decoded into what needs to be fed into the model.

Hugging Face doesn't have a Tokenizer, it uses the model's tokenization which in NLP pipelines is absolutely meaningless as users have custom tokenizers. Since the tokens are not the same in these two libraries, some IDs might slightly be different)

The closest you can get is to use RegexTokenizer with whitespace as a rule to avoid having any default tokenization rules.

LIN-Yu-Ting commented 1 year ago

@maziyarpanahi Thank you for the explanation. Does it mean that all tutorials provided in this link https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/transformers are inaccurate? because Spark NLP will use a custom Tokenizer as you say instead of tokenizer provided inside the HuggingFace models.

As I am playing on RobertaQuestionAnswering model, I have checked the implementation of RobertaClassification which is using

val bpeTokenizer = BpeTokenizer.forModel("roberta", merges, vocabulary)

inside. Do you mean that I need to replace this BpeTokenizer by RegexTokenizer to obtain the same performance as HuggingFace transformers show ?

maziyarpanahi commented 1 year ago

No, all of the transformers need their own tokenization. But they are used to encode and decode before and after we feed the model. However, before all of that, the text itself also gets tokenized into a human-readable chunks/tokens. These tokens will be internally encode/decode by the official tokenizer of that transformer, that's why you import vocab.txt, or sentence piece models, or merges, etc. (we need them or else none of those transformers would work)

But these tokenizers are not useful in NLP pipelines, users have their own tokenizers and those tokens must now be mapped to the internal BPE, SentencePiece, etc. That's why they are internal.

That said, I just realized you are using RoBertaForQuestionAnswering which it doesn't need any tokenization before. (Sorry, I though you were using RoBERTa for embeddings or classification tasks, they need a Tokenizer/RegexTokenizer in the pipeline. So I was asking if you can adjust that)

So if the text is going directly into our own internal BpeTokenizer and the results are not the same as the AutoTokenizer, then that must be an error on our side due to mixed languages. Let me assign someone to look into this closely and will update here. (sorry for the confusion)

LIN-Yu-Ting commented 1 year ago

I am guessing that this might be the problem in this file

class RobertaTokenizer(
    merges: Map[(String, String), Int],
    vocab: Map[String, Int],
    specialTokens: SpecialTokens,
    padWithSentenceTokens: Boolean = false,
    addPrefixSpace: Boolean = false)
    extends Gpt2Tokenizer(
      merges,
      vocab,
      specialTokens,
      padWithSentenceTokens,
      prependString = "Ġ",
      addPrefixSpace)

As a prependString = "Ġ" is added, the results of RobertaTokenizer is different from tokenizer of HuggingFace. However, I am not sure whether it is an error from AutoTokenizer in HuggingFace or as you say an error of Spark NLP. Thanks for your time on looking on this issue.

LIN-Yu-Ting commented 1 year ago

@DevinTDHa Could you please have a look on this issue ?

DevinTDHa commented 1 year ago

@DevinTDHa Could you please have a look on this issue ?

Hi! Yes, I'm going to take a look. I'll keep this issue updated.

maziyarpanahi commented 1 year ago

@LIN-Yu-Ting while Devin is looking to see the RoBERTa tokenization between Spark NLP and HF, could you please provide an example as what made you to look into the tokenizations? Was the answer started and ended in the wrong positions? (that's what matters in the question-answering, so I am wondering if you see a wrong answer)

LIN-Yu-Ting commented 1 year ago

@maziyarpanahi, Yes. I have a testing sentence, context: "The report ATGOncoX (ID: WW-23-02011_ONC) for the patient 吳天恩 (patient ID K112-00000_PP23039) is issued on Apr 21, 2023." question: "In which exon does the variant occur?"

I have trained by myself a QA model here, https://huggingface.co/LinYuting/atgx-roberta-base-squad2. As you can see, the prediction answer is ONC but with very low confidence.

截圖 2023-09-28 上午8 48 19

If I save this model with tutorial, and run on Spark NLP using RoBertaForQuestionAnsweringTestSpec, then I will obtain an answer with start -> 34 and end -> 34 with confidence 0.609 and 0.839.

LIN-Yu-Ting commented 1 year ago

@DevinTDHa Any updates from your side ?

DevinTDHa commented 1 year ago

Hi @LIN-Yu-Ting

I have found a bug and I am actively working on a fix. This indeed has something to do with our implementation of the tokenizer. I'll update this issue once I created a PR.

In the meantime, could you post the code which reproduced the issue? I think you are using one of our tests with your custom model. I tried to do the same but I didn't get exactly the same result (For me it extracted patient).

LIN-Yu-Ting commented 1 year ago

@DevinTDHa, you are right. I obtained the above print screen due to other changes shown in the following. I tried to investigate by myself some possibilities which cause this difference.

In RoBertaClassification.scala, I force addPrefixSpace to be true.

  def tokenizeDocument(
      docs: Seq[Annotation],
      maxSeqLength: Int,
      caseSensitive: Boolean): Seq[WordpieceTokenizedSentence] = {
    // we need the original form of the token
    // let's lowercase if needed right before the encoding
    val bpeTokenizer = BpeTokenizer.forModel("roberta", merges, vocabulary, addPrefixSpace = true)
    val sentences = docs.map { s => Sentence(s.result, s.begin, s.end, 0) }

and in RobertaTokenizer.scala, I removed one of argument prependString = "Ġ".

class RobertaTokenizer(
    merges: Map[(String, String), Int],
    vocab: Map[String, Int],
    specialTokens: SpecialTokens,
    padWithSentenceTokens: Boolean = false,
    addPrefixSpace: Boolean = false)
    extends Gpt2Tokenizer(merges, vocab, specialTokens, padWithSentenceTokens, addPrefixSpace = addPrefixSpace)

Once I roll back these two modifications, I obtain the same result as you obtained. However, this is still different from HuggingFace result. Anyway, thanks for your efforts on this issue.

DevinTDHa commented 1 year ago

@LIN-Yu-Ting Thanks for the info and thanks for your help!

I am finalizing a fix, as there are multiple factors which influenced the wrong prediction and score. It will be available in the next release. I will link the PR in this issue, once it is ready.

JohnSnowLabs / spark-nlp