Open MMMMMz opened 2 years ago
Could you please provide the code for obtaining "cat_properties_tense. pt", "user_tweets_dict.npy", "cat_properties_tense. pt"?
Have you got the code for obtaining "user_tweets_dict.npy"?
id_tweet={i:[] for i in range(len(user_idx))}
for i in range(9):
name='tweet_'+str(i)+'.json'
user_tweets=json.load(open(path1 / name,'r'))
f = open(path2 / 'print.txt')
print(1)
for each in tqdm(user_tweets):
uid='u'+str(each['author_id'])
text=each['text']
print(2, file=f)
try:
index=uid_index[uid]
id_tweet[index].append(text)
except KeyError:
continue
where we originally save id_tweet as user_tweets_dict.npy I don't know whether this would help you. And maybe later we'll submit a new updated version of code. Thanks for noticing BIC!
Thanks for your sharing!!! But it seems that this is the code for Twibot-22 Dataset while the original dataset is Twibot-20. Could you please provide the code for processing Twibot-20 dataset? :)
import pandas as pd
import numpy as np
import torch
from tqdm import tqdm
from pathlib import Path
from datetime import datetime as dt
import json
print('loading raw data')
path1=Path('20_22/data2/whr/TwiBot22-baselines/datasets/Twibot-20')
path2=Path('lzy/bot-detection/src/data')
node=pd.read_json(path1 / 'node.json')
user = node[node.id.str.contains('^u') == True]
edge=pd.read_csv(path1 / 'edge.csv')
user_idx=user['id']
uid_index={uid:index for index,uid in enumerate(user_idx.values)}
tweets = node[node.id.str.contains('^t') == True].reset_index(drop=True)
tweet_idx = tweets['id']
tid_index = {tid:index for index,tid in enumerate(tweet_idx.values)}
post = edge[edge.relation == 'post'].reset_index(drop=True)
post.loc[:,'source_id'] = list(map(lambda x:uid_index[x], post.source_id))
post.loc[:,'target_id'] = list(map(lambda x:tid_index[x], post.target_id))
# print('extracting labels and splits')
# split=pd.read_csv(path1 / "split.csv")
# label=pd.read_csv(path1 / "label.csv")
print("extracting each_user's tweets")
id_tweet={i:[] for i in range(len(user_idx))}
for i, uidx in tqdm(enumerate(post.source_id.values)):
tidx=post.target_id[i]
try:
id_tweet[uidx].append(tweets.text[tidx])
except KeyError:
print('wrong')
break
continue
json.dump(id_tweet,open(path2 / 'id_tweet.json','w'))
Ok, would this code help?
Could you please provide all the pre-training files?Thanks
We are arranging them now, you could generate them with our code and original datasets. While for the original datasets, I recommend this website: https://botometer.osome.iu.edu/bot-repository/datasets.html
Hi, I wonder what files you have? maybe I used the older version of Twibot-20.
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年7月24日(星期一) 中午11:26 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [LzyFischer/BIC] how to get some files? (Issue #1)
import pandas as pd import numpy as np import torch from tqdm import tqdm from pathlib import Path from datetime import datetime as dt import json print('loading raw data') path1=Path('20_22/data2/whr/TwiBot22-baselines/datasets/Twibot-20') path2=Path('lzy/bot-detection/src/data') node=pd.read_json(path1 / 'node.json') user = node[node.id.str.contains('^u') == True] edge=pd.read_csv(path1 / 'edge.csv') user_idx=user['id'] uid_index={uid:index for index,uid in enumerate(user_idx.values)} tweets = node[node.id.str.contains('^t') == True].reset_index(drop=True) tweet_idx = tweets['id'] tid_index = {tid:index for index,tid in enumerate(tweet_idx.values)} post = edge[edge.relation == 'post'].reset_index(drop=True) post.loc[:,'source_id'] = list(map(lambda x:uid_index[x], post.source_id)) post.loc[:,'target_id'] = list(map(lambda x:tid_index[x], post.target_id)) # print('extracting labels and splits') # split=pd.read_csv(path1 / "split.csv") # label=pd.read_csv(path1 / "label.csv") print("extracting each_user's tweets") id_tweet={i:[] for i in range(len(user_idx))} for i, uidx in tqdm(enumerate(post.source_id.values)): tidx=post.target_id[i] try: id_tweet[uidx].append(tweets.text[tidx]) except KeyError: print('wrong') break continue json.dump(id_tweet,open(path2 / 'id_tweet.json','w'))
Ok, would this code help?
I could not find ‘node.json‘, ‘edge.csv‘ in raw Twi-bot 20 dataset, so how to get them?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
The data processing under the folder of Cresci15 is the processing of Twibot-22 data set. Did the predecessors upload the wrong code? Can you provide the processing file for Cresci15 data set again?
Could you please provide the code for obtaining "cat_properties_tense. pt", "user_tweets_dict.npy", "cat_properties_tense. pt"?