Closed captify-sivakhno closed 2 months ago
Hi @captify-sivakhno
Thanks for the report, I will have a look. That said, usually when the GPU is not utilized it's more about the throughput and how many rows at once it goes in.
I suggest these resources while I'll have a look into BGEEmbeddings in particular:
If you have a closer look, unlike UDF, the Spark NLP integrates the DL part natively. If you tune the pipeline with the right batchSize, correct number of partitions for GPUs, and config the cluster right for the amount of data the Spark NLP will be 30%-40% faster and more efficient than having the same exact solution wrapped in a UDF. (pure ONNX inference for instance)
@maziyarpanahi thanks for prompt reply and comprehensive answers. Just to confirm I have tested for different batch sizes up to the point of GPU RAM errors (with .setBatchSize()
), but have not found a difference between variations. All are still 3X slower than pandas_udf
(just to note I am not using pure udf
, but pandas_udf
), specifically 3min for pandas_udf
vs for SparkNLP. I have also confirmed that GPU usage reaches 90% in both cases.
The experiments set-up is the same as above.
Could it be an issue with model implementation or should I still try optimising parameters?
Thanks for your suggestions in advance.
Hi @captify-sivakhno
BertEmbeddings
? I know it's for word embeddings as supposed to sentence embeddings, but BGE is a very new annotator, I just want to be sure it's not the annotator implementation - since you have the env ready to run the same test between spark nlp and pandas udf. This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days
Is there an existing issue for this?
Who can help?
No response
What are you working on?
I am trying to optimize workflow for creating sentence embeddings for large dataset to be used in vector database. I am using two approached (code below) to compute embeddings - with pandas_udf and SentenceTransformer and using
sparknlp.annotator.embeddings.BGEEmbeddings
on the same g5.xlarge instance and the same model.Current Behavior
I observe that code runs three times slower with BGEEmbeddings than pandas_udf.
Expected Behavior
I would have expected that BGEEmbeddings run twice as fast since I believe model SparkNLP model has been exported to ONNX format?
I wander what is the main bottleneck here since looking at GPU trace I find that it is 85% loaded when running BGEEmbeddings inference.
Maybe it's saving BGEEmbeddings model to Delta Lake?
Any suggestions how to optimize farther or additionally investigate would be most appreciated.
Steps To Reproduce
keyword_embeddings is in house data I upsample to roughly 5 the size of 500K
Compute: "cluster_instnace": "g5.xlarge",
pandas_udf
SparkNLP
Spark NLP version and Apache Spark
com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 spark-nlp==5.2.3 "spark_version": "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html spark 3.5.0
Type of Spark Application
Python Application
Java Version
No response
Java Home Directory
No response
Setup and installation
No response
Operating System and Version
No response
Link to your project (if available)
No response
Additional Information
Databricks runtime "14.3.x-gpu-ml-scala2.12" https://docs.databricks.com/en/release-notes/runtime/14.3lts-ml.html was used