SparkNLP Embeddings inference 3X slower than with pandas_udf

captify-sivakhno commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am trying to optimize workflow for creating sentence embeddings for large dataset to be used in vector database. I am using two approached (code below) to compute embeddings - with pandas_udf and SentenceTransformer and using sparknlp.annotator.embeddings.BGEEmbeddings on the same g5.xlarge instance and the same model.

Current Behavior

I observe that code runs three times slower with BGEEmbeddings than pandas_udf.

Expected Behavior

I would have expected that BGEEmbeddings run twice as fast since I believe model SparkNLP model has been exported to ONNX format?

I wander what is the main bottleneck here since looking at GPU trace I find that it is 85% loaded when running BGEEmbeddings inference.

Maybe it's saving BGEEmbeddings model to Delta Lake?

Any suggestions how to optimize farther or additionally investigate would be most appreciated.

Steps To Reproduce

keyword_embeddings is in house data I upsample to roughly 5 the size of 500K

Compute: "cluster_instnace": "g5.xlarge",

pandas_udf

model = SentenceTransformer("BAAI/bge-small-en")
broadcast_model = spark.sparkContext.broadcast(model)

@pandas_udf(returnType=ArrayType(FloatType()))
def embedd_text(x: pd.Series) -> pd.Series:
    return pd.Series(broadcast_model.value.encode(x).tolist())

keyword_embeddings.sample(withReplacement=True, fraction=10.0).select("keywords")\
    .filter((F.length(F.col("keywords")) > 9) & (F.length(F.col("keywords")) < 80))\
    .withColumn("keyphrase_embedded", embedd_text(F.col("keywords")))\
    .write.format("delta").mode("overwrite").saveAsTable("qa.tv_segmentation_bronze.semantic_embeddings_test_bge_3")

SparkNLP

from sparknlp.annotator.embeddings import BGEEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("keywords") \
    .setOutputCol("document")
embeddings = BGEEmbeddings.pretrained("bge_small", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("keyphrase_embedded") 
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings])

tmp = keyword_embeddings.sample(withReplacement=True, fraction=10.0).select("keywords")\
.filter((F.length(F.col("keywords")) > 9) & (F.length(F.col("keywords")) < 80))

pipeline.fit(tmp).transform(tmp).select("keywords", "keyphrase_embedded").write.format("delta").mode("overwrite").saveAsTable("qa.bronze.semantic_embeddings_test_bge_3_sparknlp_v3")

Spark NLP version and Apache Spark

com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 spark-nlp==5.2.3 "spark_version": "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html spark 3.5.0

Type of Spark Application

Python Application

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

Databricks runtime "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html was used

maziyarpanahi commented 8 months ago

Hi @captify-sivakhno

Thanks for the report, I will have a look. That said, usually when the GPU is not utilized it's more about the throughput and how many rows at once it goes in.

I suggest these resources while I'll have a look into BGEEmbeddings in particular:

If you have a closer look, unlike UDF, the Spark NLP integrates the DL part natively. If you tune the pipeline with the right batchSize, correct number of partitions for GPUs, and config the cluster right for the amount of data the Spark NLP will be 30%-40% faster and more efficient than having the same exact solution wrapped in a UDF. (pure ONNX inference for instance)

captify-sivakhno commented 8 months ago

@maziyarpanahi thanks for prompt reply and comprehensive answers. Just to confirm I have tested for different batch sizes up to the point of GPU RAM errors (with .setBatchSize()), but have not found a difference between variations. All are still 3X slower than pandas_udf (just to note I am not using pure udf, but pandas_udf), specifically 3min for pandas_udf vs for SparkNLP. I have also confirmed that GPU usage reaches 90% in both cases. The experiments set-up is the same as above. Could it be an issue with model implementation or should I still try optimising parameters? Thanks for your suggestions in advance.

maziyarpanahi commented 8 months ago

Hi @captify-sivakhno

Could you please share specs of that Runtime? (single-node or multi-node? How many rows, numPartitions, etc.)
Could you please run the exact same experiment for BertEmbeddings? I know it's for word embeddings as supposed to sentence embeddings, but BGE is a very new annotator, I just want to be sure it's not the annotator implementation - since you have the env ready to run the same test between spark nlp and pandas udf.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

JohnSnowLabs / spark-nlp