JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.82k stars 709 forks source link

load and apply xlm roberta tokenizer #13960

Closed yc0619 closed 1 year ago

yc0619 commented 1 year ago

Link to the documentation pages (if available)

No response

How could the documentation be improved?

Hi,

if anybody can help with loading xlm-roberta-tokenizer offline and apply it on df?

Appreciate the help. Since i didn't find anything in docu.

cheers.

maziyarpanahi commented 1 year ago

Hi,

Spark NLP has its own Tokenizer and RegexTokenizer. It will use BPE, SentencePiece, etc. internally depending on your pipeline/models.

Would you mind sharing some code and use case? (for instance, for word embeddings you can use any of these models and it will transform your text into vectors while offering custom tokenization for your text, it uses SentencePiece internally to encode/decode pieces) https://sparknlp.org/models?annotator=XlmRoBertaEmbeddings

yc0619 commented 1 year ago

Hi,

thanks for the quick reply.

spark_session = SparkSession.builder \
    .appName("SparkMongoDB") \
    .config("spark.mongodb.input.uri", ...) \
    ...
    .getOrCreate()

df = spark_session.read.format("mongo").load()
df = df.withColumn("split_clean_text", explode(udf_clean_split(df["text"])))

# document assembler
document_assembler = DocumentAssembler() \
    .setInputCol("...") \
    .setOutputCol("...")
# xlm-roberta
xlm_roberta_loaded = XlmRobertaEmbeddings.load("./models/xlm-roberta-base") \
    .setInputCols(["a", "token"]) \
    .setOutputCol("embeddings")
    .setCaseSensitive(True)
# Can i use the xlm-roberta-base-tokenizer here
# under this folder there are:
# sentencepiece.bpe.model, special_tokens_map.json, tokenizer_config.json

this script is about to predict since i have trained my classifier with transformers(hugging face) python, i want to use the same tokenization method as in training Thanks!

maziyarpanahi commented 1 year ago

As you can see, you have to also copy your sentencepiece.bpe.model to the assets in order to import it. (this is the same for all the models whether or not the tokenizer was changed)

yc0619 commented 1 year ago

yes i have seen that before. And the following code works for me. Still i have one question.

import sparknlp

from sparknlp.base import *

from sparknlp.annotator import *

from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \

    .setInputCol("text") \

    .setOutputCol("document")

tokenizer = Tokenizer() \

    .setInputCols(["document"]) \

    .setOutputCol("token")

embeddings = XlmRoBertaEmbeddings.load("/path/to/huggingface/model") \

    .setInputCols(["document", "token"]) \

    .setOutputCol("embeddings") \

    .setCaseSensitive(True)

embeddingsFinisher = EmbeddingsFinisher() \

    .setInputCols(["embeddings"]) \

    .setOutputCols("finished_embeddings") \

    .setOutputAsVector(True) \

    .setCleanAnnotations(False)

pipeline = Pipeline() \

    .setStages([

      documentAssembler,

      tokenizer,

      embeddings,

      embeddingsFinisher

    ])

Since here i have instantiate the tokenizer then i have loaded the xlm_roberta_loaded. If in pipeline, i didn't do the tokenizer, i won't get the result properly. And i would like to know, if i used the default tokenizer and xlm_roberta_loaded (xlm-roberta-base). Is that equal, when i used in transformers: XLMRobertaTokenizer to do tokenize and i send tokenized data to XLMRoberta? Thanks. :)

maziyarpanahi commented 1 year ago

No it's not. This is a NLP library, that tokenization makes no sense. You need to have a Tokenizer, if you don't have any rules you can use a simple white-space tokenization via RegexTokenizer instead. (but you do need this since all the NLP/NLU tasks in any pipeline relies on the actual tokenization)

You XLM-RoBERTa tokenizer will be used internally over those tokens after Tokenizer. (it's a common practice to map custom tokenizer to SentencePiece or BPE)