Closed LIN-Yu-Ting closed 11 months ago
Spark NLP uses a custom Tokenizer/RegexTokenize where you control how to first tokenize/split the text into meaningful tokens, then internally each tokens gets encoded/decoded into what needs to be fed into the model.
Hugging Face doesn't have a Tokenizer, it uses the model's tokenization which in NLP pipelines is absolutely meaningless as users have custom tokenizers. Since the tokens are not the same in these two libraries, some IDs might slightly be different)
The closest you can get is to use RegexTokenizer with whitespace as a rule to avoid having any default tokenization rules.
@maziyarpanahi Thank you for the explanation. Does it mean that all tutorials provided in this link https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/transformers are inaccurate? because Spark NLP will use a custom Tokenizer as you say instead of tokenizer provided inside the HuggingFace models.
As I am playing on RobertaQuestionAnswering model, I have checked the implementation of RobertaClassification which is using
val bpeTokenizer = BpeTokenizer.forModel("roberta", merges, vocabulary)
inside. Do you mean that I need to replace this BpeTokenizer by RegexTokenizer to obtain the same performance as HuggingFace transformers show ?
No, all of the transformers need their own tokenization. But they are used to encode and decode before and after we feed the model. However, before all of that, the text itself also gets tokenized into a human-readable chunks/tokens. These tokens will be internally encode/decode by the official tokenizer of that transformer, that's why you import vocab.txt, or sentence piece models, or merges, etc. (we need them or else none of those transformers would work)
But these tokenizers are not useful in NLP pipelines, users have their own tokenizers and those tokens must now be mapped to the internal BPE, SentencePiece, etc. That's why they are internal.
That said, I just realized you are using RoBertaForQuestionAnswering
which it doesn't need any tokenization before. (Sorry, I though you were using RoBERTa for embeddings or classification tasks, they need a Tokenizer/RegexTokenizer in the pipeline. So I was asking if you can adjust that)
So if the text is going directly into our own internal BpeTokenizer
and the results are not the same as the AutoTokenizer, then that must be an error on our side due to mixed languages. Let me assign someone to look into this closely and will update here. (sorry for the confusion)
I am guessing that this might be the problem in this file
class RobertaTokenizer(
merges: Map[(String, String), Int],
vocab: Map[String, Int],
specialTokens: SpecialTokens,
padWithSentenceTokens: Boolean = false,
addPrefixSpace: Boolean = false)
extends Gpt2Tokenizer(
merges,
vocab,
specialTokens,
padWithSentenceTokens,
prependString = "Ġ",
addPrefixSpace)
As a prependString = "Ġ" is added, the results of RobertaTokenizer is different from tokenizer of HuggingFace. However, I am not sure whether it is an error from AutoTokenizer in HuggingFace or as you say an error of Spark NLP. Thanks for your time on looking on this issue.
@DevinTDHa Could you please have a look on this issue ?
@DevinTDHa Could you please have a look on this issue ?
Hi! Yes, I'm going to take a look. I'll keep this issue updated.
@LIN-Yu-Ting while Devin is looking to see the RoBERTa tokenization between Spark NLP and HF, could you please provide an example as what made you to look into the tokenizations? Was the answer started and ended in the wrong positions? (that's what matters in the question-answering, so I am wondering if you see a wrong answer)
@maziyarpanahi, Yes. I have a testing sentence, context: "The report ATGOncoX (ID: WW-23-02011_ONC) for the patient 吳天恩 (patient ID K112-00000_PP23039) is issued on Apr 21, 2023." question: "In which exon does the variant occur?"
I have trained by myself a QA model here, https://huggingface.co/LinYuting/atgx-roberta-base-squad2. As you can see, the prediction answer is ONC but with very low confidence.
If I save this model with tutorial, and run on Spark NLP using RoBertaForQuestionAnsweringTestSpec, then I will obtain an answer with start -> 34 and end -> 34 with confidence 0.609 and 0.839.
@DevinTDHa Any updates from your side ?
Hi @LIN-Yu-Ting
I have found a bug and I am actively working on a fix. This indeed has something to do with our implementation of the tokenizer. I'll update this issue once I created a PR.
In the meantime, could you post the code which reproduced the issue? I think you are using one of our tests with your custom model. I tried to do the same but I didn't get exactly the same result (For me it extracted patient
).
@DevinTDHa, you are right. I obtained the above print screen due to other changes shown in the following. I tried to investigate by myself some possibilities which cause this difference.
In RoBertaClassification.scala, I force addPrefixSpace to be true.
def tokenizeDocument(
docs: Seq[Annotation],
maxSeqLength: Int,
caseSensitive: Boolean): Seq[WordpieceTokenizedSentence] = {
// we need the original form of the token
// let's lowercase if needed right before the encoding
val bpeTokenizer = BpeTokenizer.forModel("roberta", merges, vocabulary, addPrefixSpace = true)
val sentences = docs.map { s => Sentence(s.result, s.begin, s.end, 0) }
and in RobertaTokenizer.scala, I removed one of argument prependString = "Ġ".
class RobertaTokenizer(
merges: Map[(String, String), Int],
vocab: Map[String, Int],
specialTokens: SpecialTokens,
padWithSentenceTokens: Boolean = false,
addPrefixSpace: Boolean = false)
extends Gpt2Tokenizer(merges, vocab, specialTokens, padWithSentenceTokens, addPrefixSpace = addPrefixSpace)
Once I roll back these two modifications, I obtain the same result as you obtained. However, this is still different from HuggingFace result. Anyway, thanks for your efforts on this issue.
@LIN-Yu-Ting Thanks for the info and thanks for your help!
I am finalizing a fix, as there are multiple factors which influenced the wrong prediction and score. It will be available in the next release. I will link the PR in this issue, once it is ready.
Is there an existing issue for this?
Who can help?
No response
What are you working on?
I am working on HuggingFace model deepset/roberta-base-squad2. I can not obtain the same tokenized output from both HuggingFace and SparkNLP.
Current Behavior
I executed the following codes and obtained an output of Tokenizer
However, this result is different from the outputs of RobertaTokenizer in Spark NLP when I call function tokenizeDocument
I believe that this is the root cause that I can not predict the same results after I import a huggingface model into Spark NLP by referencing the tutorial HuggingFace in Spark NLP - RoBertaForQuestionAnswering.ipynb
Expected Behavior
I expect that both of them should return same list of token_ids. However, I do not know which one is correct. But this do impact the prediction performance when I want to execute a HuggingFace model on Spark NLP.
Steps To Reproduce
You can use the Spark NLP Unit Test RoBertaForQuestionAnsweringTestSpec to reproduce this issue.
Spark NLP version and Apache Spark
spark NLP 4.4.4
Type of Spark Application
No response
Java Version
No response
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
No response