bpb27 / twitter_scraping

Grab all a user's tweets (and get past 3200 limit)
592 stars 178 forks source link

Can't get the IDs #23

Open shenyizy opened 4 years ago

shenyizy commented 4 years ago

When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?

jaackland commented 4 years ago

I also have this problem, it used to work and now doesn't. My suspicion is that the problem is with the css selector. Maybe Twitter recently changed the way they store tweet ids in the css file? I also don't really know what I'm talking about because I'm pretty new to python. If you figure it out, please let me know!

jaackland commented 4 years ago

@shenyizy I think I've fixed it but I'm not entirely confident this is logic error free. It's a bit messy, but the trick is to use a new and less effective css selector. I've noticed three problems so far but I've been able to work around:

  1. The new selector will also select for hyperlinks on the names of users being replied to, so to work around that I remove all the list items that aren't fully numeric. But if your user was replying to someone with a fully numeric handle then this data point would slip through. There might be a better way to fix that than this.
  2. It also tends to duplicate a lot of tweet ids but this really doesn't matter because duplicates are removed at the end of the script.
  3. The json file doesn't get wiped at any point, so if you run for two users in a row, the second user will inherit all of the first user's tweets. My solution is to manually delete the all_ids.json between runs, which is clunky but functional.

New selector:

twitter_ids_filename = 'all_ids.json'
days = (end - start).days + 1
tweet_selector = 'article > div > div > div > div > div > div > a'
user = user.lower()
ids = []

New loop:

for day in range(days):
    d1 = format_day(increment_day(start, 0))
    d2 = format_day(increment_day(start, 1))
    url = form_url(d1, d2)
    print(url)
    print(d1)
    driver.get(url)
    sleep(delay)
    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        all_tweets = found_tweets[:]
        increment = 0

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            all_tweets += found_tweets[:]

            print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
            increment += 10   

        for tweet in all_tweets:
            try:
                id = tweet.get_attribute('href').split('/')[-1]
                ids.append(id)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

        print(ids)

    except NoSuchElementException:
        print('no tweets on this day')
    start = increment_day(start, 1)

finalids = [tweetid for tweetid in ids if tweetid.isdigit() == True]

New writetofile

try:
    with open(twitter_ids_filename) as f:
        all_ids = finalids + json.load(f)
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))
except FileNotFoundError:
    with open(twitter_ids_filename, 'w') as f:
        all_ids = finalids[-]
        data_to_write = list(set(all_ids))
        print('tweets found on this scrape: ', len(finalids))
        print('total tweet count: ', len(data_to_write))

with open(twitter_ids_filename, 'w') as outfile:
    json.dump(data_to_write, outfile)

Hope that works for you too!

ghost commented 4 years ago

@jaackland Thanks for sharing a solution. However, when I run this code, it doesn't get out of the "while len(found_tweets) >= increment:" loop. The problem is coming from the "all_tweets" variable. There is nothing added to that variable and so it never gets out of that loop. Any alternative solution?

jaackland commented 4 years ago

@AhsanCode Sorry I should have made clearer that that isn't a full script. Are you substituting that into the original Scrape.py?

ghost commented 4 years ago

@jaackland No worries, I am aware that this isn't the full script. The original version was also working for me until now. I just made the same adjustments as you did. It is the "tweet_selector" that is giving me problems at the moment.

ghost commented 4 years ago

@jaackland My mistake, I had an indent problem. The code is running fine now. There are still a couple of issues: 1: I tried to retrieve all tweets since 1st Jan 2020 and noticed that after 20 days, the code doesn't retrieve any new ID's anymore so I had to rerun the code multiple times. 2: Once all ID's are retrieved, there is still a significant number of tweets unaccounted for.

jaackland commented 4 years ago

@AhsanCode Yes unfortunately I think Twitter have managed to rate-limit Selenium now (the original post implies this wasn't always the case). If you increase the delay variable it will scrape more tweets (but take longer, obviously). I went up to 5 and got all the tweets I needed, but you might be able to get away with less than that.

Glad it was just an indent problem because as far as I can tell that tweet_selector is universal (if a bit sloppy).

shenyizy commented 4 years ago

@jaackland Thanks so much for sharing the codes. However, when I tried to substitute the original codes with yours. There is a syntax error in the part of New writetofile as shown below.

all_ids = finalids[-]
                    ^

SyntaxError: invalid syntax

I am also pretty new to python so sorry for the stupid question.

rougetimelord commented 4 years ago

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

abotmaker commented 4 years ago

I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

lebanj12 commented 4 years ago

Twitter changed CSS styling, therefore in current code you need to change id_selector and tweet_selector to:

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a" tweet_selector = 'article'

rougetimelord commented 4 years ago

id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a" tweet_selector = 'article'

I would combine the article part with the rest of the selector, for code neatness. Your selector seems to grab a few more tweets for whatever reason.

Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks

I used Chrome DevTools to generate a selector and stripped out class names, etc.

jakobberndt commented 4 years ago

I've tried the changes suggested to id_selector and tweet_selector, however I'm not getting the ID with this.

I've changed the line collecting the id (line 65) to this: id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-3]

This gives me some ids, but not even close the number of tweets I'm finding. Any suggestions on what the problem might be?

namuan commented 4 years ago

I've started writing some scripts for a project I'm working on. https://github.com/namuan/twitter-utils

Currently tweets_between.py generates a text file but I'll see if I can generate a json so that get_metadata.py script can be used without any change

rougetimelord commented 4 years ago

I'm back with a new selector! article > div > div > div > div > div > div > div.r-1d09ksm > a. The class name on the last div may change depending on platform, I've only tested it on Chrome 81. Removing it will collect some extra links to profiles but should be easy to filter those out.

I have also run into a new twitter search page which will not work with this selector. Simple fix: just restart the script, I think they're A/B testing it.

abiaus commented 4 years ago

@rougetimelord that selector worked... kinda It found the tweets but I don't know why it didn't save the tweets image

abiaus commented 4 years ago

I was able to get a list of ids... the thing is that now I can´t transform it to Json file. I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'
abiaus commented 4 years ago

I finally got it to work! But I hit Twitter limit every time.... I started at 2 seconds of sleep and did like 4 months. Then changed it to 4 and did a couple more months... it's going to be a loooooong journey.

I will try to upload my version.

camilogiga commented 4 years ago

I was able to get a list of ids... the thing is that now I can´t transform it to Json file. I always get the error "unhashable type: 'list'"

I tried transforming into a dict o a tuple... but didn't worked.

My Try code looks like this know

    try:
        found_tweets = driver.find_elements_by_css_selector(tweet_selector)
        increment = 10

        while len(found_tweets) >= increment:
            print('scrolling down to load more tweets')
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(delay)
            found_tweets = driver.find_elements_by_css_selector(tweet_selector)
            increment += 10

        print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

        for tweet in found_tweets:
            try:
                id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
                id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
                ids.append(id)
                print(id_user)
            except StaleElementReferenceException as e:
                print('lost element reference', tweet)

These are my selectors:

id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
                 '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'

tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'

Hi! Were you able to fix this error? I was trying what you recommended and it also shows the 'unhashable type: list' error.

I'd appreciate any help you could give me with this. I'm trying to retrieve tweets of specific accounts from November and December 2019.

abiaus commented 4 years ago

I submitted a Pull Request with my working version

aaronamerica commented 4 years ago

Hi, I used the updated .py file and it still doesn't work. Is there anything else I have to change?

Joseph-D-Bradshaw commented 3 years ago

@rougetimelord Just wanted to let you know that your css selector can be condensed into something way smaller article div.r-1d09ksm > a

Joseph-D-Bradshaw commented 3 years ago

I have fixed this in my own version, as I see PRs are not being sorted out I am not sure if I should make a PR here to fix it. If there are others interested in me opening a PR just let me know, there are quite a few improvements such as making sure the csv writer uses utf-8 encoding to simply using selenium's ability execute javascript to easily fetch the id information from a twitter post

orhanozbek commented 3 years ago

I submitted a Pull Request with my working version

bro ı try your version

0 tweets found, 0 total https://twitter.com/search?f=tweets&vertical=default&q=from%3Aelonmusk%20since%3A2014-04-07%20until%3A2014-04-08include%3Aretweets&src=typd 2014-04-07

getting results like this

wendeljuliao commented 3 years ago

Why dont i get all tweets from a user? For example, @lulaoficial i only scratch 8k out of 22k

nandezgarcia commented 2 years ago

The twitter webpage was modified years after this code was written. The modifications needed are as follows: id_selector and tweet_selector not required now. Only need to change the try-except code into of for loop `

try:
    found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
    for i in found_tweets:
        print("Founded: ", i.get_attribute('href'))

    increment = 10

    while len(found_tweets) >= increment:
        print('scrolling down to load more tweets')
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        sleep(delay)
        found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
        increment += 10

    print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))

    for tweet in found_tweets:
        try:
            id = tweet.get_attribute('href').split('/')[-1]
            ids.append(id)
        except StaleElementReferenceException as e:
            print('lost element reference', tweet)

except NoSuchElementException:
    print('no tweets on this day')

start = increment_day(start, 1)`