fani-lab / SEERa

A framework to predict the future user communities in a text streaming social network based on the users’ topics of interest.
Other
4 stars 5 forks source link

Crawl newer tweets #21

Closed soroush-ziaeinejad closed 2 years ago

soroush-ziaeinejad commented 2 years ago

@jalalshabo For this task, please write a code to extract tweets. Try to write it in a dynamic way so we can adjust arguments easily. For now, I think the most important arguments are the date range (ex. 2020-1-1 to 2020-5-1) and the number of users.

hosseinfani commented 2 years ago

Hi @jalalshabo and @soroush-ziaeinejad , please put this in the lower priority among the other tasks.

jalalshabo commented 2 years ago

I am working on adapting some code to crawl tweets only issue is the twitter API's returning tweets very slowly trickled in so I am working on increasing that so we can get a larger subset at once. Access to historical tweets API would make quarries with more specific arguments easier but currently open only to projects with Academic Research product track so applying and getting access to that would be beneficial.

hosseinfani commented 2 years ago

@jalalshabo @soroush-ziaeinejad I think for our research purposes, we don't need new tweets. We have another dataset of the huge amount of tweets (~200GB). it's already annotated with tagme too. I suggest starting working on that.

soroush-ziaeinejad commented 2 years ago

@hosseinfani Great! Is it in SQL or CSV format? Yes, we can take a look and start working on it. I am a little bit worried about the loading process!

hosseinfani commented 2 years ago

@soroush-ziaeinejad @jalalshabo the rar is 200GB. in sql.

A large-scale collection consists of approximately 300M tweets in English posted by 34,725,054 unique users between January 1 and June 31, 2012. Figure 7 shows the number of different types of tweets per day and Figure 8 depicts the number of tweets per user in this dataset. As shown in Figure 8, this dataset also reveals the fact that the distribution of tweets per user is a power law, i.e., a minority of users usually contribute the most while the others just free-ride

image