Mottl / GetOldTweets3

A Python 3 library and a corresponding command line utility for accessing old tweets
MIT License
365 stars 127 forks source link

HTTP Error, Gives 404 but the URL is working #98

Open sagefuentes opened 3 years ago

sagefuentes commented 3 years ago

Hi, I had a script running over the past weeks and earlier today it stopped working. I keep receiving HTTPError 404, but the provided link in the errors still brings me to a valid page. Code is (all mentioned variables are established and the error specifically happens with the Manager when I check via debugging): tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\ .setMaxTweets(max_count)\ .setSince(begin_timeframe)\ .setUntil(end_timeframe) scraped_tweets = got.manager.TweetManager.getTweets(tweetCriteria)

The error message for this is the standard 404 error "An error occured during an HTTP request: HTTP Error 404: Not Found Try to open in browser:" followed by the valid link

As I have changed nothing about the folder, I am wondering if something has happened with my configurations more so than anything else, but wondering if others are experiencing this.

alberto-valdes commented 3 years ago

Hello @sagefuentes, I'm dealing with the exact same issue, I also have been downloading tweets for the past weeks and it suddenly stops working giving me error 404 with a valid link.

I've no idea what might be the cause...

caiyishu commented 3 years ago

So like me. I also suddenly encounter this problem today, but all things went well yesterday. 下載

taoyudong commented 3 years ago

I am dealing with the same issue here. This is something new today and is caused by some changes/bugs on Twitter server side. If using the command with debug=True, the URL used to get tweets is no longer available. Seeking for solution now.

mwaters166 commented 3 years ago

Also started having the same issue today.

MithilaGuha commented 3 years ago

I'm having the same issue as well! Does anyone have a solution for it?

baraths92 commented 3 years ago

Yes. I am having the same issue. Guess everyone are having the issue.

alastairrushworth commented 3 years ago

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

    user_agents = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0',
        'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15',
    ]

https://github.com/Mottl/GetOldTweets3/blob/54a8e73d56953c7f664e073e121fa1413de4fbff/GetOldTweets3/manager/TweetManager.py#L13

stevedwards commented 3 years ago

same

Sebastokratos42 commented 3 years ago

Seems to be a "bigger" problem? Also other scrappers have problems. https://github.com/twintproject/twint/issues/915#issue-704034135

Daviey commented 3 years ago

Here is debug enabled. It shows the actual url being called, and it seems that twitter has removed the /i/search/timeline endpoint. :(

https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?vertical=news&q=from%3AREDACTED&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3AREDACTED&src=typd
danielo93 commented 3 years ago

Same problem, damn

inactivist commented 3 years ago

I'm not sure if it is related to this issue, but some of the user_agents seem to be out of date

I forked and created a branch to allow a user-specified UA, using samples from my current browser doesn't fix the problem.

I notice the search and referrer URL shown in--debug output (https://twitter.com/i/search/timeline) returns a 404 error:

$ GetOldTweets3 --username twitter --debug 
/home/inactivist/.local/bin/GetOldTweets3 --username twitter --debug
GetOldTweets3 0.0.11
Downloading tweets...
https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Host: twitter.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/i/search/timeline?f=tweets&vertical=news&q=from%3Atwitter&src=typd&&include_available_features=1&include_entities=1&max_position=&reset_error_state=false
Connection: keep-alive
An error occured during an HTTP request: HTTP Error 404: Not Found
Try to open in browser: https://twitter.com/search?q=%20from%3Atwitter&src=typd
$ curl -I https://twitter.com/i/search/timeline
HTTP/2 404 
[snip]

EDIT The url used for the internal search, and the one shown in the exception message, aren't the same...

baraths92 commented 3 years ago

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

herdemo commented 3 years ago

Unfortunately i have same problem, i hope we find a solution as soon as possible.

inactivist commented 3 years ago

I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.

Switching to mobile.twitter.com/search and using a modern User-Agent header seems to get us past the 400 bad request error, but then we get Error parsing JSON...

rennanharo commented 3 years ago

Same thing for me. I get an error 404 but the URL is working.

shelu16 commented 3 years ago

I have same issue

maldil commented 3 years ago

I am experiencing the same issue. Any plan to fix the issue?

alifzl commented 3 years ago

same issue, somebody help.

chinmuxmaximus commented 3 years ago

Same issue. The same code was working a day back now its giving error 404 with a valid link

GabrielEspeschit commented 3 years ago

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

rsafa commented 3 years ago

I am having the same issue. It was more robust than Tweepy. I hope we find a solution as soon as possible.

herdemo commented 3 years ago

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

Fiyinfoluwa6 commented 3 years ago

I have same issue. Need some help here

GabrielEspeschit commented 3 years ago

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.

I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:


Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.
brianleegit commented 3 years ago

same issue here, anyone has any idea how to solve this?

afghanpasha commented 3 years ago

same issue here, It stopped working from past three days.

JacoboGuijar commented 3 years ago

Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html

Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api. I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis

I see! I'm fairly new to scrapping, but I'm working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.

I've been tinkering with GOT3's code a bit and got it to read the HTML of the search timeline, however it's mostly unformatted. Like I said, I have little experience with scrapping so I'm really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:

  • updated user_agents (updated with the ones used by TWINT);
  • updated endpoint (/search?)
  • some updates to the URL structure:
      url = "https://twitter.com/search?"

        url += ("q=%%20%s&src=typd%s"
                "&include_available_features=1&include_entities=1&max_position=%s"
                "&reset_error_state=false")

        if not tweetCriteria.topTweets:
            url += "&f=live"`

Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.

The html problem can be easily solved with some BeautifulSoup manipulations. However I can not get the BeautifulSoup functions to work as it constantly gets and error: Error parsing JSON: <?xml version="1.0" encoding="utf-8"?> And it does not allow me to continue using the obtained HTML.

Any idea on how to avoid this error?

007sumitsingh commented 3 years ago

same issue here, I think this is because twitter has removed the endpoint https://twitter.com/i/search/timeline?

MartinBeckUT commented 3 years ago

Same issue with me and some of my other colleagues. I noticed other scrapers are also running into this problem like Twint https://github.com/twintproject/twint/issues/918

It seems to follow along the same logic mentioned above that it's likely an update on Twitter's part that is making these scraping libraries not work.

RobViren commented 3 years ago

Same issue, probably API related changes on Twitters end

marcondesgio commented 3 years ago

Same problem, today stopped working. Any solution? They are change endpoint? How to do that?

alexisqc92 commented 3 years ago

I am having the same problem, but it used to work 2 months, has anyone been able to solve this?

007sumitsingh commented 3 years ago

anyone know how to solve this?

Solin1998 commented 3 years ago

Now there are other ways to crawl Twitter??

nalam002 commented 3 years ago

So far the only method of scraping tweets that still seems to work is snscrape's jsonl method. A comment in this Twint issue explains how to do this. Please note you will need python 3.8 and the latest development version of snscrape. This doesn't export the .json result to .csv though. For that I used an online solution at first, later I used the pandas library in python for the conversion.

stevedwards commented 3 years ago

@nalam002 thank you for sharing.

eatyofood commented 3 years ago

this was in issues for 'taspinar/twitterscraper ' which also stopped working recently: image

https://github.com/taspinar/twitterscraper/issues/344

divya171990 commented 3 years ago

Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string

Atharva4899 commented 3 years ago

Found this in issues for Twint: https://github.com/twintproject/twint/pull/917#issuecomment-697361036

The trick is to encapsulate the call to the scraper in a loop, and then each time, decrement the c.until. I'm using something like this:

for x in range(0,number_of_skips): days = x * -7 end = start + timedelta(days) time.sleep(10) scrape(str(end))

The time.sleep (using the time module; in this case, ten seconds) helps avoid getting blocked on the Twitter end. The number_of_skips is a function created by the encapsulating program to determine the length of time I want to scrape in days , then dividing it by the number of days (in this case, a week).

"scrape" is just

def scrape(u_date): u_date += " 00:00:00" c = twint.Config() c.Search = st c.Store_object = True c.Limit = 40 c.Until = u_date c.Lang = "fi" twint.run.Search(c) tlist = c.search_tweet_list (and then print, store, whatever; and that's it for the loop)

Worked for me

HuifangYeo commented 3 years ago

Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string

I used the below query search and it returns me the links of the tweets.

snscrape twitter-search "#XRP since:2019-12-31 until:2020-09-25" > XRP_Sept_tweets.txt

I obtain the tweet_id and then I used tweepy to extract the tweet as I needed more attributes (may not be the best way to do):

def get_tweets(tweet_ids, currency):
    #     global api
    statuses = api.statuses_lookup(tweet_ids, tweet_mode="extended")
    data = get_df() # define your own dataframe
    # printing the statuses
    for status in statuses:
        # print(status.lang)

        if status.lang == "en":
            mined = {
                "tweet_id": status.id,
                "name": status.user.name,
                "screen_name": status.user.screen_name,
                "retweet_count": status.retweet_count,
                "text": status.full_text,
                "mined_at": datetime.datetime.now(),
                "created_at": status.created_at,
                "favourite_count": status.favorite_count,
                "hashtags": status.entities["hashtags"],
                "status_count": status.user.statuses_count,
                "followers_count": status.user.followers_count,
                "location": status.place,
                "source_device": status.source,
                "coin_symbol": currency
            }

            last_tweet_id = status.id
            data = data.append(mined, ignore_index=True)

    print(currency, "outputing to tweets", len(data))
    data.to_csv(
        f"Extracted_TWEETS.csv", mode="a", header=not os.path.exists("Extracted_TWEETS.csv"), index=False
    )
    print("..... going to sleep 20s")
    time.sleep(20)

Note that tweet_ids is a list of 100 tweet ids.

rsafa commented 3 years ago

Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string

You can use this.

saideep7997 commented 3 years ago

Even I am facing the same issue. Is there any alternative to collect tweets apart from GetOldTweets3. Could any one help me out?

rsafa commented 3 years ago

Try Python Script to Download Tweets.

alexisqc92 commented 3 years ago

@HuifangYeo, if you really need to get data from twitter try the twitter api, I am using it like this:

import tweepy
import pandas as pd
import datetime
from datetime import timedelta
from ratelimit import limits, sleep_and_retry

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

esperar = 900 #EN SEGUNDOS (15minutos)
llamadas=15 #al api, segun documentacion de twitter son 300 busquedas dentro de 15minutos
@sleep_and_retry
@limits(calls=llamadas, period=esperar)
def buscar_tweets(pais,query,idioma,date_since=None, maxItems = None):
    tweetContenido=[]
    tweetUsuario = []
    tweetUbicacion = []
    tweetPlaceName = []
    tweetCD = [] 
    tweetHashtag = []
    places = api.geo_search(query=pais, granularity="country")
    #Obtiene el ID del pais
    place_id = places[0].id
    if date_since == None:
        if maxItems == None:
            try:
                #Busca con la fecha de hoy
                tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
                               lang=idioma,since=datetime.date.today(), 
                                   extended = True,tweet_mode='extended').items()
            except Exception as e:
                print("Hubo un error. Detalles: " + str(e))
        else:
            try:
                #Busca con la fecha de hoy
                tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
                               lang=idioma,since=datetime.date.today(), 
                                   extended = True,tweet_mode='extended').items(maxItems)
            except Exception as e:
                print("Hubo un error. Detalles: " + str(e))
    else:
        if maxItems == None:
            try:
                tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
                               lang=idioma,since=date_since, 
                                   extended = True,tweet_mode='extended').items()
            except Exception as e:
                print("Hubo un error. Detalles: " + str(e))
        else:
            try:
                tweets = tweepy.Cursor(api.search,q=query and ("place:%s" % place_id),
                               lang=idioma,since=date_since, 
                                   extended = True,tweet_mode='extended').items(maxItems)
            except Exception as e:
                print("Hubo un error. Detalles: " + str(e)) 
    for tweet in tweets:
        tweetContenido.append(tweet.full_text)
        tweetUsuario.append(tweet.user.name)
        tweetUbicacion.append(tweet.user.location)
        tweetPlaceName.append(tweet.place.name)
        tweetCD.append(tweet.created_at)
        tweetHashtag.append(query)
    return tweetContenido,tweetUsuario, tweetUbicacion, tweetPlaceName,tweetCD, tweetHashtag

I did it like that but have to limit the amount of tweets otherwise you will get error 429. I also tried twint but it is not working currently, right now I think the best approch is to use this. I am using the limit rate, to wait 15minutes every 15 calls to the twitter api, this works well, but if you try to pull a lot of data, twitter will give you another error 429. I hope this can help you, and good luck. :smiley:

divya171990 commented 3 years ago

Try Python Script to Download Tweets.

I tried your script,do we need to save the csv file in some our system?

rsafa commented 3 years ago

Yes, the CSV file will be generated in the same directory of the script. line 25: tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)

divya171990 commented 3 years ago

Yes, the CSV file will be generated in the same directory of the script. line 25: tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)

but I am getting error.failed on_status, [Errno 13] Permission denied: 'COVID-tweets.csv' where COVID is the text_query

rsafa commented 3 years ago

Yes, the CSV file will be generated in the same directory of the script. line 25: tweets_df.to_csv('{}-tweets.csv'.format(text_query), sep=',', index=False)

but I am getting error.failed on_status, [Errno 13] Permission denied: 'COVID-tweets.csv' where COVID is the text_query

Please check this: https://stackoverflow.com/questions/40984791/python-errno-13-permission-denied

nalam002 commented 3 years ago

Any alternative solution for it? My masters thesis is on hold because of it. I tried snscrape as mentioned in above comment but it does not return result based on a search query string

You can get the results by running a code like this: snscrape --jsonl twitter-search "YOURSEARCHQUERY @USERTODLFROM #HASHTAGTODLFROM since:2020-09-01 until:2020-09-25"> mytweets.json

I then ran the .json file through this tiny python code to get my .csv , which is enough for me right now. You might wanna check out the other answers if you're looking for something more elegant with more info.

import pandas as pd
from io import StringIO
with open('mytweets.json', 'r', encoding ='utf-8-sig') as f:
    data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv("mytweets.csv", encoding ='utf-8-sig')`