JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.81k stars 708 forks source link

Requesting a feature to find the similarity between text columns #1023

Closed kaniska closed 2 years ago

kaniska commented 4 years ago

Is your feature request related to a problem? Please describe.

Problem: Need to calculate the similarity between texts stored in 2 columns of the same or different dataframes For example, there are 2 dataframes - DataFrame1 and DataFrame2; each of which contains a column column product_type_list

DataFrame1 [ product_type_list1] DataFrame2 [ product_type_list2]

I want to calculate the similarity between these 2 columns product_type_list1 and product_type_list2

I explored all API doc and samples , but couldn't find out any working example to calculate text similarity using Spark-NLP library in Scala!

Describe the solution you'd like

Nice to have:

so that I can create the following output dataframe: DataFrame3[product_type_list1, product_type_list2 , similarityScore]

Please let me know if a solution already exists.

Describe alternatives you've considered Here goes an example of text similarity using spacy. pip install spacy python -m spacy download en_core_web_lg import spacy import en_core_web_lg nlp = en_core_web_lg.load() doc1 = nlp("Wall Decals Lamp Shades Armchairs Bed Sheets Night Lights Necklaces Decorative Pillow Covers Table Lamps Decorative Boxes Lamps Slumber Bags Figurines Tableware Plates Decorative Pillows Fancy-Dress Costumes Curtains Canvas Art Prints")

doc2 = nlp("Curtains & Valances Wall Decals & Stickers Beds Area Rugs Bedding Sets Activity Tables Lamps Doll Playsets Interlocking Block Building Sets Night Lights Armchairs & Accent Chairs Organizing Racks Table Lamps Desks Bed Sheets Bookcases")

print("output:" , doc1.similarity(doc2)) --> 0.8

maziyarpanahi commented 4 years ago

Thanks for the feature request, up until now users would combine our SentenceEmbeddings or UniversalSentenceEncoder to create features out of the text column, and then use EmbeddingsFinisher to prepare the data for Spark ML: https://spark.apache.org/docs/2.4.6/ml-features.html#locality-sensitive-hashing

There is an example here how to go from Spark NLP to Spark ML for regression, classification, or similarity calculation: https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/EmbeddingsFinisherTestSpec.scala

However, it is a good idea to have a built-in doc similarity since we offer so many embeddings especially in the upcoming 2.6.0 with 24 new BERT sentence embeddings. I'll see how we can plan this to make it easy within the Spark NLP ecosystem.

kaniska commented 4 years ago

Hi @maziyarpanahi ,

I got a chance to successfully implement a similarity test pipeline by following your suggestion both in google cloud (PySpark) and Databricks (Scala Spark)

https://gist.github.com/kaniska/3ae7368c7566456df1e78ae72b2ed751

Please review the code and let me know any fix needed and if it can be simplified for both small and large corpus.

QQ - if I have 2 columns (text1, text2 - as demonstrated in the example) , do I need to fit the pipeline always on text1 or can I fit against any large amount of corpus to build the model ?

If it looks Ok, please suggest in which spark-nlp project should I create the merge request in order to add this example ?

Thanks very much, Kaniska

maziyarpanahi commented 4 years ago

Hi @kaniska

This is great! To answer your question, the actual calculation for embeddings (both token or sentence) happens during .transform() not fit(). If there is anything required during .fit() it will use the DataFrame pass to it otherwise it will ignore it.

The way you do it is actually a correct way. In your pipeline, the fit() is probably ignored since none of the annotators in the pipeline uses fit() for training, so it will go directly to your 2 transforms since there is no training in your pipeline.

So regardless of the size of any of the DataFrames, it will always transform both of them and it will do approxSimilarityJoin between the two DataFrames.

This will be a good place to have your example: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/scala/annotation

Thank you again for this very nice demonstration. We do have lots of new native BertSentenceEmbeddings in 2.6.0, you can try those as well.

kaniska commented 4 years ago

Hi @maziyarpanahi ,

Here go the PRs:

https://github.com/JohnSnowLabs/spark-nlp-workshop/pull/107 (python) https://github.com/JohnSnowLabs/spark-nlp-workshop/pull/106 (scala)

Thanks Kaniska

maziyarpanahi commented 4 years ago

This is fantastic! Many thanks @kaniska

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days