fxsjy / jieba

结巴中文分词

MIT License

33.31k stars 6.72k forks source link

弹幕分析 #1003

Open ubh0927 opened 11 months ago

ubh0927 commented 11 months ago

纪录片弹幕.csv 分析F列的中文，将重复的文字进行删除

ubh0927 commented 11 months ago

import pandas as pd

Load the CSV file

danmu_df = pd.read_csv('path_to_your_file.csv')

Remove duplicates in column 'F' and keep the first occurrence

danmu_df_unique = danmu_df.drop_duplicates(subset=['F'])

Save the DataFrame with duplicates removed to a new CSV file

danmu_df_unique.to_csv('path_to_your_new_file.csv', index=False)

ubh0927 commented 11 months ago

Load the CSV file and remove duplicate entries in column 'F'

Attempt to load the CSV file, trying different encodings if necessary

try:

Trying with default encoding first

danmu_df = pd.read_csv('/mnt/data/纪录片弹幕.csv')

except UnicodeDecodeError:

If default encoding fails, trying with 'gbk' encoding which is commonly used for Chinese text

danmu_df = pd.read_csv('/mnt/data/纪录片弹幕.csv', encoding='gbk')

Remove duplicates in column 'F' and keep the first occurrence

danmu_df_unique = danmu_df.drop_duplicates(subset=['F'])

Save the DataFrame with duplicates removed to a new CSV file

output_path = '/mnt/data/纪录片弹幕_no_duplicates.csv' danmu_df_unique.to_csv(output_path, index=False)

output_path