Open sagefuentes opened 3 years ago
Hello @sagefuentes, I'm dealing with the exact same issue, I also have been downloading tweets for the past weeks and it suddenly stops working giving me error 404 with a valid link.
I've no idea what might be the cause...
So like me. I also suddenly encounter this problem today, but all things went well yesterday.
I am dealing with the same issue here. This is something new today and is caused by some changes/bugs on Twitter server side. If using the command with debug=True, the URL used to get tweets is no longer available. Seeking for solution now.
Also started having the same issue today.
I'm having the same issue as well! Does anyone have a solution for it?
Yes. I am having the same issue. Guess everyone are having the issue.
I'm not sure if it is related to this issue, but some of the user_agents
seem to be out of date
user_agents = [
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15',
]
same
Seems to be a "bigger" problem? Also other scrappers have problems. https://github.com/twintproject/twint/issues/915#issue-704034135
Here is debug enabled. It shows the actual url being called, and it seems that twitter has removed the /i/search/timeline
endpoint. :(
https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3AREDACTED&src=typd
Same problem, damn
I'm not sure if it is related to this issue, but some of the
user_agents
seem to be out of date
I forked and created a branch to allow a user-specified UA, using samples from my current browser doesn't fix the problem.
I notice the search and referrer URL shown in--debug
output (https://twitter.com/i/search/timeline
) returns a 404 error:
$ GetOldTweets3 --username twitter --debug
/home/inactivist/.local/bin/GetOldTweets3 --username twitter --debug
GetOldTweets3 0.0.11
Downloading tweets...
https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3Atwitter&src=typd
$ curl -I https://twitter.com/i/search/timeline
HTTP/2 404
[snip]
EDIT The url used for the internal search, and the one shown in the exception message, aren't the same...
I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.
Unfortunately i have same problem, i hope we find a solution as soon as possible.
I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.
Switching to mobile.twitter.com/search
and using a modern User-Agent header seems to get us past the 400 bad request error, but then we get Error parsing JSON
...
Same thing for me. I get an error 404 but the URL is working.
I have same issue
I am experiencing the same issue. Any plan to fix the issue?
same issue, somebody help.
Same issue. The same code was working a day back now its giving error 404 with a valid link
Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html
I am having the same issue. It was more robust than Tweepy. I hope we find a solution as soon as possible.
Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html
Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis
I have same issue. Need some help here
Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html
Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis
I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.
I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:
updated user_agents (updated with the ones used by TWINT);
updated endpoint (/search?)
some updates to the URL structure:
url = "https://twitter.com/search?"
url += ("q=%%20%s&src=typd%s"
"&include_available_features=1&include_entities=1&max_position=%s"
"&reset_error_state=false")
if not tweetCriteria.topTweets:
url += "&f=live"`
Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.
same issue here, anyone has any idea how to solve this?
same issue here, It stopped working from past three days.
Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html
Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis
I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.
I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:
- updated user_agents (updated with the ones used by TWINT);
- updated endpoint (/search?)
- some updates to the URL structure:
url = "https://twitter.com/search?" url += ("q=%%20%s&src=typd%s" "&include_available_features=1&include_entities=1&max_position=%s" "&reset_error_state=false") if not tweetCriteria.topTweets: url += "&f=live"`
Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.
The html problem can be easily solved with some BeautifulSoup manipulations. However I can not get the BeautifulSoup functions to work as it constantly gets and error:
Error parsing JSON: <?xml version="1.0" encoding="utf-8"?>
And it does not allow me to continue using the obtained HTML.
Any idea on how to avoid this error?
same issue here, I think this is because twitter has removed the endpoint https://twitter.com/i/search/timeline?
Same issue with me and some of my other colleagues. I noticed other scrapers are also running into this problem like Twint https://github.com/twintproject/twint/issues/918
It seems to follow along the same logic mentioned above that it's likely an update on Twitter's part that is making these scraping libraries not work.
Same issue, probably API related changes on Twitters end
Same problem, today stopped working. Any solution? They are change endpoint? How to do that?
I am having the same problem, but it used to work 2 months, has anyone been able to solve this?
anyone know how to solve this?
Now there are other ways to crawl Twitter??
So far the only method of scraping tweets that still seems to work is snscrape's jsonl method. A comment in this Twint issue explains how to do this. Please note you will need python 3.8 and the latest development version of snscrape. This doesn't export the .json result to .csv though. For that I used an online solution at first, later I used the pandas library in python for the conversion.
@nalam002 thank you for sharing.
this was in issues for 'taspinar/twitterscraper ' which also stopped working recently:
Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string
Found this in issues for Twint: https://github.com/twintproject/twint/pull/917#issuecomment-697361036
The trick is to encapsulate the call to the scraper in a loop, and then each time, decrement the c.until. I'm using something like this:
for x in range(0,number_of_skips): days = x * -7 end = start + timedelta(days) time.sleep(10) scrape(str(end))
The time.sleep (using the time module; in this case, ten seconds) helps avoid getting blocked on the Twitter end. The number_of_skips is a function created by the encapsulating program to determine the length of time I want to scrape in days , then dividing it by the number of days (in this case, a week).
"scrape" is just
def scrape(u_date): u_date += " 00:00:00" c = twint.Config() c.Search = st c.Store_object = True c.Limit = 40 c.Until = u_date c.Lang = "fi" twint.run.Search(c) tlist = c.search_tweet_list (and then print, store, whatever; and that's it for the loop)
Worked for me
Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string
I used the below query search and it returns me the links of the tweets.
snscrape twitter-search "#XRP since:2019-12-31 until:2020-09-25" > XRP_Sept_tweets.txt
I obtain the tweet_id and then I used tweepy to extract the tweet as I needed more attributes (may not be the best way to do):
def get_tweets(tweet_ids, currency):
# global api
statuses = api.statuses_lookup(tweet_ids, tweet_mode="extended")
data = get_df() # define your own dataframe
# printing the statuses
for status in statuses:
# print(status.lang)
if status.lang == "en":
mined = {
"tweet_id": status.id,
"name": status.user.name,
"screen_name": status.user.screen_name,
"retweet_count": status.retweet_count,
"text": status.full_text,
"mined_at": datetime.datetime.now(),
"created_at": status.created_at,
"favourite_count": status.favorite_count,
"hashtags": status.entities["hashtags"],
"status_count": status.user.statuses_count,
"followers_count": status.user.followers_count,
"location": status.place,
"source_device": status.source,
"coin_symbol": currency
}
last_tweet_id = status.id
data = data.append(mined, ignore_index=True)
print(currency, "outputing to tweets", len(data))
data.to_csv(
f"Extracted_TWEETS.csv", mode="a", header=not os.path.exists("Extracted_TWEETS.csv"), index=False
)
print("..... going to sleep 20s")
time.sleep(20)
Note that tweet_ids is a list of 100 tweet ids.
Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string
You can use this.
Even I am facing the same issue. Is there any alternative to collect tweets apart from GetOldTweets3. Could any one help me out?
@HuifangYeo, if you really need to get data from twitter try the twitter api, I am using it like this:
import tweepy
import pandas as pd
import datetime
from datetime import timedelta
from ratelimit import limits, sleep_and_retry
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
esperar = 900 #EN SEGUNDOS (15minutos)
llamadas=15 #al api, segun documentacion de twitter son 300 busquedas dentro de 15minutos
@sleep_and_retry
@limits(calls=llamadas, period=esperar)
def buscar_tweets(pais,query,idioma,date_since=None, maxItems = None):
tweetContenido=[]
tweetUsuario = []
tweetUbicacion = []
tweetPlaceName = []
tweetCD = []
tweetHashtag = []
places = api.geo_search(query=pais, granularity="country")
#Obtiene el ID del pais
place_id = places[0].id
if date_since == None:
if maxItems == None:
try:
#Busca con la fecha de hoy
tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
lang=idioma,since=datetime.date.today(),
extended = True,tweet_mode='extended').items()
except Exception as e:
print("Hubo un error. Detalles: " + str(e))
else:
try:
#Busca con la fecha de hoy
tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
lang=idioma,since=datetime.date.today(),
extended = True,tweet_mode='extended').items(maxItems)
except Exception as e:
print("Hubo un error. Detalles: " + str(e))
else:
if maxItems == None:
try:
tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
lang=idioma,since=date_since,
extended = True,tweet_mode='extended').items()
except Exception as e:
print("Hubo un error. Detalles: " + str(e))
else:
try:
tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
lang=idioma,since=date_since,
extended = True,tweet_mode='extended').items(maxItems)
except Exception as e:
print("Hubo un error. Detalles: " + str(e))
for tweet in tweets:
tweetContenido.append(tweet.full_text)
tweetUsuario.append(tweet.user.name)
tweetUbicacion.append(tweet.user.location)
tweetPlaceName.append(tweet.place.name)
tweetCD.append(tweet.created_at)
tweetHashtag.append(query)
return tweetContenido,tweetUsuario, tweetUbicacion, tweetPlaceName,tweetCD, tweetHashtag
I did it like that but have to limit the amount of tweets otherwise you will get error 429. I also tried twint but it is not working currently, right now I think the best approch is to use this. I am using the limit rate, to wait 15minutes every 15 calls to the twitter api, this works well, but if you try to pull a lot of data, twitter will give you another error 429. I hope this can help you, and good luck. :smiley:
I tried your script,do we need to save the csv file in some our system?
Yes, the CSV file will be generated in the same directory of the script.
line 25: tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)
Yes, the CSV file will be generated in the same directory of the script. line 25:
tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)
but I am getting error.failed on_status, [Errno 13] Permission denied: 'COVID-tweets.csv' where COVID is the text_query
Yes, the CSV file will be generated in the same directory of the script. line 25:
tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)
but I am getting error.failed on_status, [Errno 13] Permission denied: 'COVID-tweets.csv' where COVID is the text_query
Please check this: https://stackoverflow.com/questions/40984791/python-errno-13-permission-denied
Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string
You can get the results by running a code like this:
snscrape --jsonl twitter-search "YOURSEARCHQUERY @USERTODLFROM #HASHTAGTODLFROM since:2020-09-01 until:2020-09-25"> mytweets.json
I then ran the .json file through this tiny python code to get my .csv , which is enough for me right now. You might wanna check out the other answers if you're looking for something more elegant with more info.
import pandas as pd
from io import StringIO
with open('mytweets.json', 'r', encoding ='utf-8-sig') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv("mytweets.csv", encoding ='utf-8-sig')`
Hi, I had a script running over the past weeks and earlier today it stopped working. I keep receiving HTTPError 404, but the provided link in the errors still brings me to a valid page. Code is (all mentioned variables are established and the error specifically happens with the Manager when I check via debugging):
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\ .setMaxTweets(max_count)\ .setSince(begin_timeframe)\ .setUntil(end_timeframe) scraped_tweets = got.manager.TweetManager.getTweets(tweetCriteria)
The error message for this is the standard 404 error "An error occured during an HTTP request: HTTP Error 404: Not Found Try to open in browser:" followed by the valid link
As I have changed nothing about the folder, I am wondering if something has happened with my configurations more so than anything else, but wondering if others are experiencing this.