Closed soroush-ziaeinejad closed 2 years ago
Hi @jalalshabo and @soroush-ziaeinejad , please put this in the lower priority among the other tasks.
I am working on adapting some code to crawl tweets only issue is the twitter API's returning tweets very slowly trickled in so I am working on increasing that so we can get a larger subset at once. Access to historical tweets API would make quarries with more specific arguments easier but currently open only to projects with Academic Research product track so applying and getting access to that would be beneficial.
@jalalshabo @soroush-ziaeinejad I think for our research purposes, we don't need new tweets. We have another dataset of the huge amount of tweets (~200GB). it's already annotated with tagme too. I suggest starting working on that.
@hosseinfani Great! Is it in SQL or CSV format? Yes, we can take a look and start working on it. I am a little bit worried about the loading process!
@soroush-ziaeinejad @jalalshabo the rar is 200GB. in sql.
A large-scale collection consists of approximately 300M tweets in English posted by 34,725,054 unique users between January 1 and June 31, 2012. Figure 7 shows the number of different types of tweets per day and Figure 8 depicts the number of tweets per user in this dataset. As shown in Figure 8, this dataset also reveals the fact that the distribution of tweets per user is a power law, i.e., a minority of users usually contribute the most while the others just free-ride
@jalalshabo For this task, please write a code to extract tweets. Try to write it in a dynamic way so we can adjust arguments easily. For now, I think the most important arguments are the date range (ex. 2020-1-1 to 2020-5-1) and the number of users.