LzyFischer / BIC

BIC: Twitter Bot Detection with Text-Graph Interaction and Semantic Consistency
https://arxiv.org/abs/2208.08320
13 stars 3 forks source link

how to get some files? #1

Open MMMMMz opened 2 years ago

MMMMMz commented 2 years ago

Could you please provide the code for obtaining "cat_properties_tense. pt", "user_tweets_dict.npy", "cat_properties_tense. pt"?

Mr-lonely0 commented 1 year ago

Could you please provide the code for obtaining "cat_properties_tense. pt", "user_tweets_dict.npy", "cat_properties_tense. pt"?

Have you got the code for obtaining "user_tweets_dict.npy"?

LzyFischer commented 1 year ago
id_tweet={i:[] for i in range(len(user_idx))}  
for i in range(9):
    name='tweet_'+str(i)+'.json'
    user_tweets=json.load(open(path1 / name,'r'))
    f = open(path2 / 'print.txt')
    print(1)
    for each in tqdm(user_tweets):
        uid='u'+str(each['author_id'])
        text=each['text']
        print(2, file=f)
        try:
            index=uid_index[uid]
            id_tweet[index].append(text)
        except KeyError:
            continue

where we originally save id_tweet as user_tweets_dict.npy I don't know whether this would help you. And maybe later we'll submit a new updated version of code. Thanks for noticing BIC!

Mr-lonely0 commented 1 year ago

Thanks for your sharing!!! But it seems that this is the code for Twibot-22 Dataset while the original dataset is Twibot-20. Could you please provide the code for processing Twibot-20 dataset? :)

LzyFischer commented 1 year ago
import pandas as pd
import numpy as np
import torch
from tqdm import tqdm
from pathlib import Path
from datetime import datetime as dt
import json
print('loading raw data')
path1=Path('20_22/data2/whr/TwiBot22-baselines/datasets/Twibot-20')
path2=Path('lzy/bot-detection/src/data')

node=pd.read_json(path1 / 'node.json')
user = node[node.id.str.contains('^u') == True]
edge=pd.read_csv(path1 / 'edge.csv')
user_idx=user['id']

uid_index={uid:index for index,uid in enumerate(user_idx.values)}

tweets = node[node.id.str.contains('^t') == True].reset_index(drop=True)
tweet_idx = tweets['id']
tid_index = {tid:index for index,tid in enumerate(tweet_idx.values)}

post = edge[edge.relation == 'post'].reset_index(drop=True)
post.loc[:,'source_id'] = list(map(lambda x:uid_index[x], post.source_id))
post.loc[:,'target_id'] = list(map(lambda x:tid_index[x], post.target_id))

# print('extracting labels and splits')
# split=pd.read_csv(path1 / "split.csv")
# label=pd.read_csv(path1 / "label.csv")

print("extracting each_user's tweets")
id_tweet={i:[] for i in range(len(user_idx))}

for i, uidx in tqdm(enumerate(post.source_id.values)):
    tidx=post.target_id[i]
    try:
        id_tweet[uidx].append(tweets.text[tidx])
    except KeyError:
        print('wrong')
        break
        continue
json.dump(id_tweet,open(path2 / 'id_tweet.json','w'))

Ok, would this code help?

kekew85 commented 1 year ago

Could you please provide all the pre-training files?Thanks

LzyFischer commented 1 year ago

We are arranging them now, you could generate them with our code and original datasets. While for the original datasets, I recommend this website: https://botometer.osome.iu.edu/bot-repository/datasets.html

LzyFischer commented 1 year ago

Hi, I wonder what files you have? maybe I used the older version of Twibot-20.

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年7月24日(星期一) 中午11:26 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [LzyFischer/BIC] how to get some files? (Issue #1)

import pandas as pd import numpy as np import torch from tqdm import tqdm from pathlib import Path from datetime import datetime as dt import json print('loading raw data') path1=Path('20_22/data2/whr/TwiBot22-baselines/datasets/Twibot-20') path2=Path('lzy/bot-detection/src/data') node=pd.read_json(path1 / 'node.json') user = node[node.id.str.contains('^u') == True] edge=pd.read_csv(path1 / 'edge.csv') user_idx=user['id'] uid_index={uid:index for index,uid in enumerate(user_idx.values)} tweets = node[node.id.str.contains('^t') == True].reset_index(drop=True) tweet_idx = tweets['id'] tid_index = {tid:index for index,tid in enumerate(tweet_idx.values)} post = edge[edge.relation == 'post'].reset_index(drop=True) post.loc[:,'source_id'] = list(map(lambda x:uid_index[x], post.source_id)) post.loc[:,'target_id'] = list(map(lambda x:tid_index[x], post.target_id)) # print('extracting labels and splits') # split=pd.read_csv(path1 / "split.csv") # label=pd.read_csv(path1 / "label.csv") print("extracting each_user's tweets") id_tweet={i:[] for i in range(len(user_idx))} for i, uidx in tqdm(enumerate(post.source_id.values)): tidx=post.target_id[i] try: id_tweet[uidx].append(tweets.text[tidx]) except KeyError: print('wrong') break continue json.dump(id_tweet,open(path2 / 'id_tweet.json','w'))
Ok, would this code help?

I could not find ‘node.json‘, ‘edge.csv‘ in raw Twi-bot 20 dataset, so how to get them?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

QEureka commented 11 months ago

The data processing under the folder of Cresci15 is the processing of Twibot-22 data set. Did the predecessors upload the wrong code? Can you provide the processing file for Cresci15 data set again?