Open binaryinspace opened 5 months ago
Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.
import os import pandas as pd import torch from PIL import Image import cn_clip.clip as clip from cn_clip.clip import load_from_name
device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = load_from_name("ViT-H-14", device=device, download_root='./')
image_folder = "C:/soft/pycharm/file11111111/爬虫/合并后的图片" excel_file = "C:/soft/pycharm/file11111111/爬虫/合并后的文档.xlsx"
captions_df = pd.read_excel(excel_file, names=['index', 'text','usefulVoteCount'])
all_image_features = [] all_text_features = []
for filename in os.listdir(image_folder): if filename.endswith(".jpg"): # 假设所有图片都是 jpg 格式的
image_path = os.path.join(image_folder, filename)
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# 使用 CLIP 模型编码图像
with torch.no_grad():
image_features = model.encode_image(image)
# 对特征进行归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
# 存储图像特征向量
all_image_features.append(image_features)
# 使用相应的标题(根据文件名匹配)编码文本
index = int(filename.split("_")[1].split(".")[0]) - 1 # 提取文件名中的索引号
text = clip.tokenize([captions_df.loc[index, 'text']]).to(device)
# 使用 CLIP 模型编码文本
with torch.no_grad():
text_features = model.encode_text(text)
# 对特征进行归一化
text_features /= text_features.norm(dim=-1, keepdim=True)
# 存储文本特征向量
all_text_features.append(text_features)
import numpy as np
combined_image_features = torch.cat(all_image_features, dim=0) combined_text_features = torch.cat(all_text_features, dim=0) combined_features = torch.cat((combined_image_features, combined_text_features), dim=1)
combined_features = combined_features.cpu().numpy()
docs = captions_df['text'].tolist()
print(f"combined_features shape: {combined_features.shape}") print(f"Number of documents: {len(docs)}")
embeddings = combined_features import jieba from bertopic import BERTopic from umap import UMAP from hdbscan import HDBSCAN from bertopic.vectorizers import ClassTfidfTransformer
umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine',random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', prediction_data=True)
from sklearn.feature_extraction.text import CountVectorizer stoplists = list(pd.read_csv('停用词.txt', names=['w'], sep='\t', encoding='utf-8').w) vectorizer_model = CountVectorizer(stop_words=stoplists, ngram_range=(1,1)) ctfidf_model = ClassTfidfTransformer() topic_model = BERTopic( umap_model=umap_model, # Step 2 - Reduce dimensionality hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics ctfidf_model=ctfidf_model, # Step 5 - Extract topic words nr_topics='none', top_n_words=10, )
topics, probs = topic_model.fit_transform(docs, embeddings)
Most likely, you are not using the right processor in the CountVectorizer. Could you share your full code? Also, please check out the FAQ.
here is my full code, thanks for your helping
Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.
Thanks! Definitely check out the FAQ, it should solve your problem since your input are Chinese texts.
Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose. My current approach is to use the clustering results generated by the model, and feed the documents under different topics to the LLM for topic word extraction. Also, more and more Chinese scholars are using your model for research and applications, because it is really great!
Sorry for the late reply!
Thank you very much for your response. In fact, I had tried the def tokenize_zh(text) method before, but it still failed. I think there might be more complex reasons. Because when I used the sentence embeddings generated by the sentence transformer for analysis, the model was extremely successful, but when I used CLIP to generate embeddings, problems arose.
I fixed some things in BERTopic v0.16.1 that might relate to the problem you had. You should indeed still use tokenize_zh
but the problems with CLIP should/might be resolved.
Also, more and more Chinese scholars are using your model for research and applications, because it is really great!
Thank you for sharing this! Wonderful to hear that more Chinese scholars are using BERTopic. If you ever have any feedback, feel free to reach out!
I am a student from China, and I really appreciate your project. I am now trying to do some interesting work, but I have encountered some problems. My idea is to perform topic modeling using product images and text reviews. Since the clip-ViT-B-32 encoder does not support Chinese, I am using another CLIP model trained on Chinese data to generate image_features and text_features. Then, I perform a concatenation operation to generate combined_image_features as the embeddings for BERTopic, and pass each image's corresponding review as the docs to the model. The good news is that the model works, but there is a problem with the topic representation: it only produces some meaningless English words and numbers. Since I am not an expert in the field of multimodal computing, I don't know which part of the model has gone wrong.