Open MarionMeyers opened 5 years ago
Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,
Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins
Some of the methods to download faster are as follows,
1) Try to get more Twitter API keys and configure them in keys file so that more resources are available.
2) If more key are added, them increase the num_process
so many parallel processes will be used to collect data.
User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.
Hi, I'm running the updated version of the script for 5 days now and the download is still not complete. How long will it approximately take?
@rrichajalota so how long is it approximately taking?
@SaschaStenger just a quick question. @mdepak says that should we add more keys, we can increase the _numprocess. I though this was just dependent on how many parallel processes our CPU could run. So how would this work out approximately? Suppose I have 3 sets of API keys (_"num_twitterkeys": 3). I will set _"numprocess": 12?
@rlleshi So the number of processes is more or less limited by your number of keys. You can only make a certain number of calls ever 15 minutes and the keyserver will manage them. So everytime one of your processes wants to download something, it asks the keyserver if there are any keys left whose limit has not yet been reached. And usually the bottleneck is not the number of processes you can run, but the fact, that your keys will reach their temporal download limit.
If you need the data in the json format, that this code is providing them in, i can not tell you, how long the download takes. Because the code checks every tweet in the database, it also uses API calls for tweets that have already been deleted from twitter (due to terms of service violations, or because the user hid or deleted them) I modified the code for myself, as i need the data in another format, and it took me around 3 days with 8 keys and 8 threads to download everything.
Wow 3 days, this is your repository for the modified code?
Thanks a lot for the help!
No. it's not the current code. i have to modify mine a bit, as it's a bit messy and still has some of my keys in it. At the moment it's not something i would share and feel compfortable about. But i can take out my access keys and upload it, so that you have at least a working version.
Be advised though, that i download to csv format and do not download all the information, that the tweet objects provide. It's possible to change the downloaded content though.
Yeah sure. When do you plan to upload the updated version?
I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo
Thanks a lot for sharing this! One question though: there you mention that the code checks which files have already been downloaded so that it can be stopped and resumed at will. Is this not the case with the original repository as well? I thought it was.
I would have to check again, but I’m pretty sure there is no check on what is already downloaded. Or I worked with a version that was a little older and didn’t have it.
@SaschaStenger are you sure this is the working version? I tried to download twice till now and the collection abruptly stops after it collects the news articles. Thanks
Does it give you any error messages? Try downloading tweets only. I haven't changed much outside the tweet or retweet code.
The problem was with the path on the _newsdir at _retweetscollection.py, namely it was saving the content inside the politifact/ folder and not politifact/fake/ as specified in the code (I am crawling only fake news but I am not sure if that is causing this). But I fixed it now. Thanks again for sharing your modified version!
@SaschaStenger but this updated code will not work for the rest of the features (user profile, user timeline tweets, etc) right?
Yes, I have not remodeled the rest of the code. To remodel the user profile and user timeline collection, have a look at twarc. But I don't know if and by how much the twarc library would improve efficiency.
Hi @SaschaStenger and @rlleshi ,
Thanks for the updated code, makes life so much easier. However, the code for retweets is not working as it should. The contents in the "complete" folder are much less than the calculations done for number of tweets with retweets and the content of files is the same as the same file in "tweets" folder (meaning none of the retweets information is being added).
In short:
Am I missing something? Any help in the right direction will be highly appreciated and I can change the code accordingly.
I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo
Hello, is it possible for you to share your downloaded data?
I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo
Hello, is it possible for you to share your downloaded data?
Hey
I’m sorry, but the Twitter guidelines prohibit the sharing of datasets. That’s the reason, that the original dataset is only consisting of tweet ids, as that’s the only legal way of sharing this data.
Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,
Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins
Some of the methods to download faster are as follows,
- Try to get more Twitter API keys and configure them in keys file so that more resources are available.
- If more key are added, them increase the
num_process
so many parallel processes will be used to collect data.User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.
This one is brutal:
Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.
Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.
Hi @SaschaStenger, there is no code/resource folder in your repo, so where to add the tweet_keys_file.txt? From the original FakeNewsNet, I am able to download the politifact and Gossipcop news but there is no social media data. Kindly help! Thanks in advance
o my
Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins Some of the methods to download faster are as follows,
- Try to get more Twitter API keys and configure them in keys file so that more resources are available.
- If more key are added, them increase the
num_process
so many parallel processes will be used to collect data.User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.
This one is brutal:
- user_followers: 15 requests / 15 minutes
- user_following: 15 requests / 15 minutes (https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids)
Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.
Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.
Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ
o my 哦我的
Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,您好,通常是因为 Twitter 对调用某些 API 进行数据收集的次数施加了一定的限制。由于数据集包含大量推文,因此通常需要相当多的时间才能下载整个数据集。以下是 Twitter 文档中的一些 API 限制, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) 推文收集 API - 900 条推文/15 分钟 ( https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) 转发收集 API - 75 条转发/15 分钟 ( https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins 用户个人资料收集 - 900 个用户/15 分钟 User timeline posts - 900 users / 15 mins用户时间线帖子 - 900 个用户/15 分钟 Some of the methods to download faster are as follows,一些加快下载速度的方法如下:
- Try to get more Twitter API keys and configure them in keys file so that more resources are available.尝试获取更多 Twitter API 密钥并在密钥文件中配置它们,以便有更多资源可用。
- If more key are added, them increase the
num_process
so many parallel processes will be used to collect data.如果添加更多密钥,它们会增加num_process
,因此将使用许多并行进程来收集数据。User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.用户暂停问题很常见,因为某些用户可能会被 Twitter 标记为可疑并可能被暂停。当我们尝试收集此类用户配置文件时,会发生此错误,这很正常。
This one is brutal:这个很残酷:
- user_followers: 15 requests / 15 minutesuser_followers:15 个请求/15 分钟
- user_following: 15 requests / 15 minutesuser_following:15 个请求/15 分钟 (https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids) (https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids)
Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.目前共有262,935名用户。即每个 API 密钥 60/小时 => 1,440/天。一把钥匙就是 182 天。 Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.Twitter 每天只允许创建 3 个 API 密钥,每个开发者帐户总共有 10 个 API 密钥。因此,对于 10 个密钥,仅关注者就需要 18.2 天。冲洗并重复以下步骤。享受。
Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ天哪,这真是一件令人悲伤的事情!那么我们可以在哪里以相当的方式下载它们......?我的意思是,我真的需要一个完整的数据集来完成我的家务......QAQ
o my
Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins Some of the methods to download faster are as follows,
- Try to get more Twitter API keys and configure them in keys file so that more resources are available.
- If more key are added, them increase the
num_process
so many parallel processes will be used to collect data.User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.
This one is brutal:
- user_followers: 15 requests / 15 minutes
- user_following: 15 requests / 15 minutes (https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/api-reference/get-followers-ids)
Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days. Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.
Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ
Have you finished your housework? I mean ,have you already downloaded the full dataset? Because I am experiencing the same trouble as you.
how can i get the full dataset?? can anybody help??
Hello!
I am trying to download the dataset, I have managed to download the news_content (took a long time already ), and I am now in the process of retrieving tweets. However, I have been running it for 10 hours now and it only says 35% downloaded, is that normal?
Also, this error re-occurs all the time :
Can somebody help me out?
Thank you!
Marion