binaryinspace commented 5 months ago

I am a student from China, and I really appreciate your project. I am now trying to do some interesting work, but I have encountered some problems. My idea is to perform topic modeling using product images and text reviews. Since the clip-ViT-B-32 encoder does not support Chinese, I am using another CLIP model trained on Chinese data to generate image_features and text_features. Then, I perform a concatenation operation to generate combined_image_features as the embeddings for BERTopic, and pass each image's corresponding review as the docs to the model. The good news is that the model works, but there is a problem with the topic representation: it only produces some meaningless English words and numbers. Since I am not an expert in the field of multimodal computing, I don't know which part of the model has gone wrong.

MaartenGr commented 5 months ago

Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.

binaryinspace commented 5 months ago

import os import pandas as pd import torch from PIL import Image import cn_clip.clip as clip from cn_clip.clip import load_from_name

加载 CLIP 模型

device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = load_from_name("ViT-H-14", device=device, download_root='./')

图像文件夹路径和标题列表的 Excel 文件路径

image_folder = "C:/soft/pycharm/file11111111/爬虫/合并后的图片" excel_file = "C:/soft/pycharm/file11111111/爬虫/合并后的文档.xlsx"

读取 Excel 文件中的标题列表

captions_df = pd.read_excel(excel_file, names=['index', 'text','usefulVoteCount'])

存储所有图像的特征向量和文本的特征向量

all_image_features = [] all_text_features = []

遍历图像文件夹中的每张图片

for filename in os.listdir(image_folder): if filename.endswith(".jpg"): # 假设所有图片都是 jpg 格式的

加载图像并进行预处理

    image_path = os.path.join(image_folder, filename)
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

    # 使用 CLIP 模型编码图像
    with torch.no_grad():
        image_features = model.encode_image(image)
        # 对特征进行归一化
        image_features /= image_features.norm(dim=-1, keepdim=True)

    # 存储图像特征向量
    all_image_features.append(image_features)

    # 使用相应的标题（根据文件名匹配）编码文本
    index = int(filename.split("_")[1].split(".")[0]) - 1  # 提取文件名中的索引号
    text = clip.tokenize([captions_df.loc[index, 'text']]).to(device)

    # 使用 CLIP 模型编码文本
    with torch.no_grad():
        text_features = model.encode_text(text)
        # 对特征进行归一化
        text_features /= text_features.norm(dim=-1, keepdim=True)

    # 存储文本特征向量
    all_text_features.append(text_features)

import numpy as np

将所有的图像特征和文本特征拼接成一个嵌入向量

combined_image_features = torch.cat(all_image_features, dim=0) combined_text_features = torch.cat(all_text_features, dim=0) combined_features = torch.cat((combined_image_features, combined_text_features), dim=1)

将 combined_features 转换为 NumPy 数组

combined_features = combined_features.cpu().numpy()

将标题列表转换为 Python 列表

docs = captions_df['text'].tolist()

检查combined_features的形状

print(f"combined_features shape: {combined_features.shape}") print(f"Number of documents: {len(docs)}")

combined_features已经是正确的形状(num_samples, embedding_dim)

直接将其赋值给embeddings

embeddings = combined_features import jieba from bertopic import BERTopic from umap import UMAP from hdbscan import HDBSCAN from bertopic.vectorizers import ClassTfidfTransformer

Step 1 - Reduce dimensionality

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine',random_state=42)

Step 2 - Cluster reduced embeddings

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)

Step 3 - Create topic representation

from sklearn.feature_extraction.text import CountVectorizer stoplists = list(pd.read_csv('停用词.txt', names=['w'], sep='\t', encoding='utf-8').w) vectorizer_model = CountVectorizer(stop_words=stoplists, ngram_range=(1,1)) ctfidf_model = ClassTfidfTransformer() topic_model = BERTopic( umap_model=umap_model, # Step 2 - Reduce dimensionality hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics ctfidf_model=ctfidf_model, # Step 5 - Extract topic words nr_topics='none', top_n_words=10, )

Train model

topics, probs = topic_model.fit_transform(docs, embeddings)

binaryinspace commented 5 months ago

Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.

here is my full code， thanks for your helping

MaartenGr commented 5 months ago

Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.

binaryinspace commented 5 months ago

Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.

Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose. My current approach is to use the clustering results generated by the model, and feed the documents under different topics to the LLM for topic word extraction. Also, more and more Chinese scholars are using your model for research and applications, because it is really great!

MaartenGr commented 4 months ago

Sorry for the late reply!

Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose.

I fixed some things in BERTopic v0.16.1 that might relate to the problem you had. You should indeed still use tokenize_zh but the problems with CLIP should/might be resolved.

Also, more and more Chinese scholars are using your model for research and applications, because it is really great!

Thank you for sharing this! Wonderful to hear that more Chinese scholars are using BERTopic. If you ever have any feedback, feel free to reach out!

MaartenGr / BERTopic

multimodal problem #1918

加载 CLIP 模型

图像文件夹路径和标题列表的 Excel 文件路径

读取 Excel 文件中的标题列表

存储所有图像的特征向量和文本的特征向量

遍历图像文件夹中的每张图片

加载图像并进行预处理

将所有的图像特征和文本特征拼接成一个嵌入向量

将 combined_features 转换为 NumPy 数组

将标题列表转换为 Python 列表

检查combined_features的形状

combined_features已经是正确的形状(num_samples, embedding_dim)

直接将其赋值给embeddings

Step 1 - Reduce dimensionality

Step 2 - Cluster reduced embeddings

Step 3 - Create topic representation

Train model