Closed yc0619 closed 1 year ago
Hi,
Spark NLP has its own Tokenizer and RegexTokenizer. It will use BPE, SentencePiece, etc. internally depending on your pipeline/models.
Would you mind sharing some code and use case? (for instance, for word embeddings you can use any of these models and it will transform your text into vectors while offering custom tokenization for your text, it uses SentencePiece internally to encode/decode pieces) https://sparknlp.org/models?annotator=XlmRoBertaEmbeddings
Hi,
thanks for the quick reply.
spark_session = SparkSession.builder \
.appName("SparkMongoDB") \
.config("spark.mongodb.input.uri", ...) \
...
.getOrCreate()
df = spark_session.read.format("mongo").load()
df = df.withColumn("split_clean_text", explode(udf_clean_split(df["text"])))
# document assembler
document_assembler = DocumentAssembler() \
.setInputCol("...") \
.setOutputCol("...")
# xlm-roberta
xlm_roberta_loaded = XlmRobertaEmbeddings.load("./models/xlm-roberta-base") \
.setInputCols(["a", "token"]) \
.setOutputCol("embeddings")
.setCaseSensitive(True)
# Can i use the xlm-roberta-base-tokenizer here
# under this folder there are:
# sentencepiece.bpe.model, special_tokens_map.json, tokenizer_config.json
this script is about to predict since i have trained my classifier with transformers(hugging face) python, i want to use the same tokenization method as in training Thanks!
XlmRobertaEmbeddings
: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20XLM-RoBERTa.ipynbAs you can see, you have to also copy your sentencepiece.bpe.model
to the assets
in order to import it. (this is the same for all the models whether or not the tokenizer was changed)
yes i have seen that before. And the following code works for me. Still i have one question.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = XlmRoBertaEmbeddings.load("/path/to/huggingface/model") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)
embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)
pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])
Since here i have instantiate the tokenizer then i have loaded the xlm_roberta_loaded. If in pipeline, i didn't do the tokenizer, i won't get the result properly. And i would like to know, if i used the default tokenizer and xlm_roberta_loaded (xlm-roberta-base). Is that equal, when i used in transformers: XLMRobertaTokenizer to do tokenize and i send tokenized data to XLMRoberta? Thanks. :)
No it's not. This is a NLP library, that tokenization makes no sense. You need to have a Tokenizer, if you don't have any rules you can use a simple white-space tokenization via RegexTokenizer instead. (but you do need this since all the NLP/NLU tasks in any pipeline relies on the actual tokenization)
You XLM-RoBERTa tokenizer will be used internally over those tokens after Tokenizer. (it's a common practice to map custom tokenizer to SentencePiece or BPE)
Link to the documentation pages (if available)
No response
How could the documentation be improved?
Hi,
if anybody can help with loading xlm-roberta-tokenizer offline and apply it on df?
Appreciate the help. Since i didn't find anything in docu.
cheers.