LuoUndergradXJTU / TwiBot-22

Offical repository of TwiBot-22 @ NeurIPS 2022, Datasets and Benchmarks Track.
MIT License
147 stars 39 forks source link

Data collection process #3

Closed brunocvs7 closed 2 years ago

brunocvs7 commented 2 years ago

Hi, Twibot-22 staff. First of all, congratulations for the work You have done!.

I have some questions about the data collection process. A brought the A.2 part of the paper, so We can go through It together.

"For the first stage of user network collection, we adopt @NeurIPSConf as the starting user. We use the Twitter API to retrieve 1,000 followers and 1,000 followees as the user’s neighborhood for BFS expansion. We randomly adopt one of the two sampling strategies (distribution diversity or value diversity) and randomly select one metadata from Table 6 to include 6 users from its neighborhood into the TwiBot-22 dataset. We then randomly select one unexpanded user in TwiBot-22 for a new round of neighborhood expansion. "

1) What I understood was that from 2k neighbors (followers + followees) of @NeurlPSConf, 6 of them were chosen to retrieve tweets, retweets, hashtags, and all of the data Twibot-22 provides. Besides, the next round was done from 1 of the 1994 remaining users in the neighborhood of @NeurlPSConf. Yet, 6 more neighbors were selected and their information was retrieved. Is that correct? Sorry, but I could not figure out how you achieved 1MM users in this process. Could You please provide more details? Even a picture to illustrate this process?

whr000001 commented 2 years ago

Thank you for your interest in our work.
For each crawling user, (e.g. @NeurIPSConf), we collect 2k neighbors(followers and followees) and crawl the whole related information (profile, timeline, related lists) of them. To extend the graph, we sample 6 followers and 6 followees and crawl their neighbors. Namely, we consider these 12 users as new crawling users. We consider the steps above as an extension. For an extension, we can get 2k users and 12 new crawling users. The figure may help you.

figure

As this figure shows, T0 has 1 user, T1 has 2k users, and T2 has around 24k users. Namely, we can get 1MM users by around 1000 extensions (around 5 Tiers).

brunocvs7 commented 2 years ago

Thank you for your explanation, now everything is clear to me.