agbld / semantic-search-for-EE5327701

A demonstration of semantic feature extraction for the NTUST Big Data Analysis course (EE5327701).
2 stars 1 forks source link

Semantic Search System for NTUST Big Data Analysis Course (EE5327701)

This repository provides a semantic search system tailored for the NTUST Big Data Analysis course (EE5327701). It allows users to generate embeddings for e-commerce product descriptions using either a distilled, quantized EcomBERT model or the CKIP BERT model. The system then utilizes FAISS for efficient similarity search, enabling semantic search over large datasets.

Special thanks to William Wu (clw8998) for creating the distilled and quantized models used in this project.

Features

Table of Contents

Installation

  1. Clone the repository:

    git clone https://github.com/agbld/semantic-search-for-EE5327701.git
    cd semantic-search-for-EE5327701
  2. Make Sure You Have PyTorch Installed: Ensure you have PyTorch installed. If not, you can install it from the official website.

  3. Install Python Dependencies: Ensure you have Python 3.7 or higher installed. Install the required packages using:

    pip install -r requirements.txt

    Note: The key packages include transformers, sentence-transformers, faiss-cpu, optimum, huggingface_hub, numpy, pandas, and tqdm.

  4. Download Datasets and Pre-computed Embeddings:

    Note: You do not need to run any script explicitly to download the datasets and embeddings. When you run the example_search.py script, it will automatically import and execute get_dataset.py, which handles downloading and setting up the necessary files.

Usage

Performing Semantic Search

Use the example_search.py script to perform semantic search over the pre-computed embeddings.

Note: The first time you run the script with a particular model (e.g., semantic_model or ckipbert), it will automatically download the model from Hugging Face Hub. This may take some time depending on your internet connection speed. Subsequent runs will load the model from the local cache, which will be much faster.

  1. Run the example_search.py script:

    python example_search.py --model_type semantic_model --top_k 5
    • --model_type: Choose between semantic_model (EcomBERT) or ckipbert (CKIP BERT).
    • --top_k: Specify the number of top results to return.
  2. Interactive Search: After running the script, you can enter queries interactively:

    Enter query (type "exit" to quit): Your search query here

    The script will output the top matching product descriptions along with their similarity scores.

Example:

EcomBERT Model:

$ python example_search.py 
Number of products: 1000000
Number of pre-computed embeddings: 1000000
FAISS index built with 1000000 vectors.
Enter query (type "exit" to quit): Wireless Bluetooth Headphones
Took 0.1734 seconds to search
[Rank 1 | Score: 0.7561] Wireless 2矽膠耳機殼+扣, 單品, 白色(案例)
[Rank 2 | Score: 0.7478] NS_AIR4 WIRELESS EARBUDS 藍芽耳機(黑), 1個
[Rank 3 | Score: 0.7199] Kinyo 藍牙耳機 60 x 27 x 53mm 充電盒36g 單支耳機4g, BTE-3905, 1個
[Rank 4 | Score: 0.7138] 登山扣修身耳機盒, 海軍, 谷歌像素芽 2
[Rank 5 | Score: 0.7121] DEKONI AUDIO Deco -Bluetooth耳機耳朵尖端TWS泡沫提示6p, 交易平台_M, 單色
Enter query (type "exit" to quit): 寵物玩具
Took 0.1423 seconds to search
[Rank 1 | Score: 0.9329] 寵物狗玩具 3入 S號, 混色, 1套
[Rank 2 | Score: 0.9323] 動物造型寵物玩具組 3入, 隨機發貨, 1套
[Rank 3 | Score: 0.9270] SUPER PET 寵物用玩具組, 隨機發貨(紫薯), 1套
[Rank 4 | Score: 0.9227] multipet 絨毛寵物玩具 L號, 1個, 隨機發貨
[Rank 5 | Score: 0.9208] 青年商城寵物玩具耐用4件套, 混色, 1組
Enter query (type "exit" to quit): 洗衣精
Took 0.1622 seconds to search
[Rank 1 | Score: 0.8855] 茶樹莊園 超濃縮洗衣精補充包 天然抗菌, 1.5kg, 4包
[Rank 2 | Score: 0.8850] 茶樹莊園 超濃縮洗衣精 純淨消臭, 1.8kg, 5瓶
[Rank 3 | Score: 0.8834] 茶樹莊園 茶樹洗衣精組合包, 茶樹洗衣精2000g+茶樹洗衣精補充包1500g, 1組
[Rank 4 | Score: 0.8829] 茶樹莊園 超濃縮洗衣精 純淨消臭, 1.8kg, 3瓶
[Rank 5 | Score: 0.8755] 茶樹莊園 超濃縮洗衣精補充包 天然抗菌, 1.5kg, 3包

CKIP BERT Model:

$ python example_search.py --model_type ckipbert
Number of products: 1000000
Number of pre-computed embeddings: 1000000
FAISS index built with 1000000 vectors.
Enter query (type "exit" to quit): Wireless Bluetooth Headphones
Took 0.1463 seconds to search
[Rank 1 | Score: 0.8343] Foot-On Jaguar Fine Pattern 消聲器
[Rank 2 | Score: 0.8301] VRS Dewallet Hybrid Origin MagSafe 卡儲存支架可拆卸手機殼
[Rank 3 | Score: 0.8263] EXPEAK Tracking Climbing 休閒智能手機袋 黃色
[Rank 4 | Score: 0.8239] LEADCOOL ARGB記憶體散熱器 4入, RH-1 EVO
[Rank 5 | Score: 0.8236] Rykel Allround Grip 2 磁性支架
Enter query (type "exit" to quit): 寵物玩具
Took 0.1451 seconds to search
[Rank 1 | Score: 0.8563] 寵物餵食器玩具, 白色的
[Rank 2 | Score: 0.8274] jw 寵物活動玩具四足小鳥玩具, 1個
[Rank 3 | Score: 0.8230] 寵物睡墊, 綠色
[Rank 4 | Score: 0.8188] 動物造型變形機器人玩具, 狼
[Rank 5 | Score: 0.8184] 動物造型變形機器人玩具, 豹
Enter query (type "exit" to quit): 洗衣精
Took 0.1454 seconds to search
[Rank 1 | Score: 0.7846] 洗髮精 直接擦鞋劑
[Rank 2 | Score: 0.7517] 洗衣烘乾架, 1個
[Rank 3 | Score: 0.7513] 洗衣劑, 2個, 1.7L
[Rank 4 | Score: 0.7503] 洗衣機防塵罩, 5
[Rank 5 | Score: 0.7491] 洗碗機用液體洗滌劑, 1L, 1個

Generating Embeddings (Optional)

Note: Pre-computed embeddings are already provided and downloaded when you run example_search.py. Generating embeddings is optional and only necessary if you wish to practice or experiment with the embedding generation process.

You can generate embeddings using either the EcomBERT model (semantic_model.py) or the CKIP BERT model (ckipbert.py).

Using the EcomBERT Model

  1. Run the semantic_model.py script:

    python semantic_model.py

    This script will:

    • Load the distilled EcomBERT model.
    • Process CSV files under ./random_samples_1M/.
    • Generate embeddings and save them as .npy files under ./embeddings/semantic_model/.

    Note: The first time you run this script, it will automatically download the EcomBERT model from Hugging Face Hub. This may take some time.

Using the CKIP BERT Model

  1. Run the ckipbert.py script:

    python ckipbert.py

    This script will:

    • Load the CKIP BERT model.
    • Process CSV files under ./random_samples_1M/.
    • Generate embeddings and save them as .npy files under ./embeddings/ckipbert/.

    Note: The first time you run this script, it will automatically download the CKIP BERT model from Hugging Face Hub. This may take some time.

Scripts Overview

example_search.py

ckipbert.py

semantic_model.py

get_dataset.py


This repository is created for academic purposes as part of the NTUST Big Data Analysis course (EE5327701). Feel free to modify and extend it for other projects!