KaiDMML / FakeNewsNet

This is a dataset for fake news detection research
1.1k stars 429 forks source link

Download Time #12

Open MarionMeyers opened 5 years ago

MarionMeyers commented 5 years ago

Hello!

I am trying to download the dataset, I have managed to download the news_content (took a long time already ), and I am now in the process of retrieving tweets. However, I have been running it for 10 hours now and it only says 35% downloaded, is that normal?

Also, this error re-occurs all the time :

Can somebody help me out?

Thank you!

Marion

mdepak commented 5 years ago

Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,

Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins

Some of the methods to download faster are as follows, 1) Try to get more Twitter API keys and configure them in keys file so that more resources are available. 2) If more key are added, them increase the num_process so many parallel processes will be used to collect data.

User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.

rrichajalota commented 5 years ago

Hi, I'm running the updated version of the script for 5 days now and the download is still not complete. How long will it approximately take?

rlleshi commented 4 years ago

@rrichajalota so how long is it approximately taking?

rlleshi commented 4 years ago

@SaschaStenger just a quick question. @mdepak says that should we add more keys, we can increase the _numprocess. I though this was just dependent on how many parallel processes our CPU could run. So how would this work out approximately? Suppose I have 3 sets of API keys (_"num_twitterkeys": 3). I will set _"numprocess": 12?

SaschaStenger commented 4 years ago

@rlleshi So the number of processes is more or less limited by your number of keys. You can only make a certain number of calls ever 15 minutes and the keyserver will manage them. So everytime one of your processes wants to download something, it asks the keyserver if there are any keys left whose limit has not yet been reached. And usually the bottleneck is not the number of processes you can run, but the fact, that your keys will reach their temporal download limit.

If you need the data in the json format, that this code is providing them in, i can not tell you, how long the download takes. Because the code checks every tweet in the database, it also uses API calls for tweets that have already been deleted from twitter (due to terms of service violations, or because the user hid or deleted them) I modified the code for myself, as i need the data in another format, and it took me around 3 days with 8 keys and 8 threads to download everything.

rlleshi commented 4 years ago

Wow 3 days, this is your repository for the modified code?
Thanks a lot for the help!

SaschaStenger commented 4 years ago

No. it's not the current code. i have to modify mine a bit, as it's a bit messy and still has some of my keys in it. At the moment it's not something i would share and feel compfortable about. But i can take out my access keys and upload it, so that you have at least a working version.

Be advised though, that i download to csv format and do not download all the information, that the tweet objects provide. It's possible to change the downloaded content though.

rlleshi commented 4 years ago

Yeah sure. When do you plan to upload the updated version?

SaschaStenger commented 4 years ago

I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo

rlleshi commented 4 years ago

Thanks a lot for sharing this! One question though: there you mention that the code checks which files have already been downloaded so that it can be stopped and resumed at will. Is this not the case with the original repository as well? I thought it was.

SaschaStenger commented 4 years ago

I would have to check again, but I’m pretty sure there is no check on what is already downloaded. Or I worked with a version that was a little older and didn’t have it.

rlleshi commented 4 years ago

@SaschaStenger are you sure this is the working version? I tried to download twice till now and the collection abruptly stops after it collects the news articles. Thanks

SaschaStenger commented 4 years ago

Does it give you any error messages? Try downloading tweets only. I haven't changed much outside the tweet or retweet code.

rlleshi commented 4 years ago

The problem was with the path on the _newsdir at _retweetscollection.py, namely it was saving the content inside the politifact/ folder and not politifact/fake/ as specified in the code (I am crawling only fake news but I am not sure if that is causing this). But I fixed it now. Thanks again for sharing your modified version!

rlleshi commented 4 years ago

@SaschaStenger but this updated code will not work for the rest of the features (user profile, user timeline tweets, etc) right?

SaschaStenger commented 4 years ago

Yes, I have not remodeled the rest of the code. To remodel the user profile and user timeline collection, have a look at twarc. But I don't know if and by how much the twarc library would improve efficiency.

shaanchandra commented 4 years ago

Hi @SaschaStenger and @rlleshi ,

Thanks for the updated code, makes life so much easier. However, the code for retweets is not working as it should. The contents in the "complete" folder are much less than the calculations done for number of tweets with retweets and the content of files is the same as the same file in "tweets" folder (meaning none of the retweets information is being added).

In short:

  1. No. of files expected in the "complete" folder are much less (ideally we will expect the same number of files in "complete" as there are in "tweets" since it has all the tweets AND the retweets)
  2. The ones that are indeed present in the "complete" folder however are exactly the same as the ones in "tweets" meaning no additional retweets information was added.

Am I missing something? Any help in the right direction will be highly appreciated and I can change the code accordingly.

iBibek commented 4 years ago

I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo

Hello, is it possible for you to share your downloaded data?

SaschaStenger commented 4 years ago

I just plastered together a version, that should work. Like I said, it's not pretty and maybe i made a mistake with the refactoring somewhere. But i uploaded it and wrote a short description on what the changes are. Link to repo

Hello, is it possible for you to share your downloaded data?

Hey

I’m sorry, but the Twitter guidelines prohibit the sharing of datasets. That’s the reason, that the original dataset is only consisting of tweet ids, as that’s the only legal way of sharing this data.

kyleiwaniec commented 3 years ago

Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,

Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins

Some of the methods to download faster are as follows,

  1. Try to get more Twitter API keys and configure them in keys file so that more resources are available.
  2. If more key are added, them increase the num_process so many parallel processes will be used to collect data.

User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.

This one is brutal:

Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.

Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.

asc111 commented 3 years ago

Hi @SaschaStenger, there is no code/resource folder in your repo, so where to add the tweet_keys_file.txt? From the original FakeNewsNet, I am able to download the politifact and Gossipcop news but there is no social media data. Kindly help! Thanks in advance

Rilkia commented 2 years ago

o my

Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins Some of the methods to download faster are as follows,

  1. Try to get more Twitter API keys and configure them in keys file so that more resources are available.
  2. If more key are added, them increase the num_process so many parallel processes will be used to collect data.

User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.

This one is brutal:

Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.

Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.

Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ

boyang-x commented 1 year ago

o my 哦我的

Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs,您好,通常是因为 Twitter 对调用某些 API 进行数据收集的次数施加了一定的限制。由于数据集包含大量推文,因此通常需要相当多的时间才能下载整个数据集。以下是 Twitter 文档中的一些 API 限制, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) 推文收集 API - 900 条推文/15 分钟 ( https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) 转发收集 API - 75 条转发/15 分钟 ( https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins 用户个人资料收集 - 900 个用户/15 分钟 User timeline posts - 900 users / 15 mins用户时间线帖子 - 900 个用户/15 分钟 Some of the methods to download faster are as follows,一些加快下载速度的方法如下:

  1. Try to get more Twitter API keys and configure them in keys file so that more resources are available.尝试获取更多 Twitter API 密钥并在密钥文件中配置它们,以便有更多资源可用。
  2. If more key are added, them increase the num_process so many parallel processes will be used to collect data.如果添加更多密钥,它们会增加 num_process ,因此将使用许多并行进程来收集数据。

User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.用户暂停问题很常见,因为某些用户可能会被 Twitter 标记为可疑并可能被暂停。当我们尝试收集此类用户配置文件时,会发生此错误,这很正常。

This one is brutal:这个很残酷:

Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days.目前共有262,935名用户。即每个 API 密钥 60/小时 => 1,440/天。一把钥匙就是 182 天。 Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.Twitter 每天只允许创建 3 个 API 密钥,每个开发者帐户总共有 10 个 API 密钥。因此,对于 10 个密钥,仅关注者就需要 18.2 天。冲洗并重复以下步骤。享受。

Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ天哪,这真是一件令人悲伤的事情!那么我们可以在哪里以相当的方式下载它们......?我的意思是,我真的需要一个完整的数据集来完成我的家务......QAQ

o my

Hi @MarionMeyers , it generally time because Twitter imposes certain constraints on the number of times certain APIs can be called for data collection. Since the dataset has a large number of tweets it generally takes quite some to download the entire dataset. Following are some of the API limits from Twitter docs, Tweet Collection API - 900 tweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-show-id) Retweet collection API - 75 retweets / 15 mins (https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-retweets-id.html) User profile collection - 900 users / 15 mins User timeline posts - 900 users / 15 mins Some of the methods to download faster are as follows,

  1. Try to get more Twitter API keys and configure them in keys file so that more resources are available.
  2. If more key are added, them increase the num_process so many parallel processes will be used to collect data.

User suspended issues is common as some users can be marked as suspicious by Twitter and can be suspeneded. When we try to collect such user profiles, this error happens and this normal.

This one is brutal:

Currently there are a total of 262,935 users. That's 60/hr => 1,440/day, per API key. With one key that's 182 days. Twitter only allows 3 API keys to be created per day for a total of 10 API keys per developer account. So with 10 keys, it will take 18.2 days just for followers. Rinse and repeat for following. Enjoy.

Oh my god, It is a quite sad thing !!! So where we can download them in a quite way...? I mean, I really need a full dataset to finish my housework....... QAQ

Have you finished your housework? I mean ,have you already downloaded the full dataset? Because I am experiencing the same trouble as you.

rasel3413 commented 7 months ago

how can i get the full dataset?? can anybody help??