GeneralMills / pytrends

Pseudo API for Google Trends
Other
3.12k stars 798 forks source link

Is PyTrends working for ANYONE right now? #594

Open nicktba opened 11 months ago

nicktba commented 11 months ago

Is anyone having success with pytrends currently?

If so, comment what parameters you are using (timeframe, region, etc)

If not, comment what errors you are receiving.

bluefinch83 commented 11 months ago

No, it's not. I keep getting a 429 errors, even when only calling "interest_over_time()" a few times with 10 seconds of spacing. Even when I can get a valid response, 2/3rds of the dataframe is empty. I really hope they fix this.

Helldez commented 11 months ago

No it's not working, the latest reports are all about this problem

EonAndahalf commented 11 months ago

Right now I'm not even getting 429, just an empty DataFrame

nicktba commented 11 months ago

Has anyone tried reaching out to Google?

Not in relation to the PyTrends library but rather that the API isn't working at all, including for the embedding widget.

If anyone wants to embed a Google Trends chart on their website or share it to socials it's missing data or just doesn't work at all.

This API issue is the cause for libraries such as PyTrends to return 429 & 500 errors, empty dataframes etc..

Screenshot 2023-07-25 at 3 43 02 PM Screenshot 2023-07-25 at 3 41 45 PM
Raidus commented 11 months ago

@nicktba I have reached out to google and next day it worked for a few hours a bit better but I guess it was just a coincidence :-)

I have written a browser based crawler (stable chrome with installed extensions and logged in user) which automates by mimicking real user behaviour and intercepting api requests to get the desired data but even this approach got blocked very fast.

Also Iframes are not working as you mentioned.

nicktba commented 11 months ago

@nicktba I have reached out to google and next day it worked for a few hours a bit better but I guess it was just a coincidence :-)

I have written a browser based crawler (stable chrome with installed extensions and logged in user) which automates by mimicking real user behaviour and intercepting api requests to get the desired data but even this approach got blocked very fast.

Also Iframes are not working as you mentioned.

Would you mind sharing that code?

Raidus commented 11 months ago

@nicktba I am allowed to share part of the code which is related to intercepting the requests. It's quick & dirty code for nodejs (using puppeteer) which was intended for proof of concept. Not production code :-)

const setupInterceptor = async function (page) {
    await page.setRequestInterception(true);

    page.on("request", async (request) => {
        const ignore = [
            // "image",
            // "stylesheet",
            // "font",
            // "media",
            // "mp4",
            // "avi",
            // "flv",
            // "mov",
            // "wmv",
            // "webp",
            // "javascript",
            // "manifest",
            // "texttrack",
        ];

        // ignore all request type which is in the ignore array
        if (ignore.includes(request.resourceType())) {
            return request.abort();
        } else {
            return request.continue();
        }
    });

    page.on("response", async (response) => {
        if (response.url().includes("api")) {
            const url = response.url();

            // get status code
            const status = response.status();

            if (status == 429) {
                console.log("BANNED URL", url);
            }
            // For related keywords and related topics
            if (url.includes("related")) {
                try {
                    const payload = await response.text();

                    if (payload.includes(`link`)) {
                        const data = JSON.parse(payload.replace(")]}',", "")).default.rankedList[0].rankedKeyword;

                        const type = data[0].query ? "RELATED_KEYWORDS" : "RELATED_TOPICS";

                        const { kw_id } = globals;

                        if (type === "RELATED_KEYWORDS") {
                            const keywords = data.map((item) => ({ keyword: item.query, value: item.value }));

                            console.log(keywords);
                        }

                        if (type === "RELATED_TOPICS") {
                            console.log(url);
                            const topics = data.map((item) => ({
                                mid: item.topic.mid,
                                topic: item.topic.title,
                                type: item.topic.type,
                                value: item.value,
                            }));

                            console.log(topics);
                        }
                    }
                } catch (error) {
                    console.error("FAILED FOR RELATED", error.message);
                }
            }

            // For related keywords and related topics
            if (url.includes("multiline")) {
                try {
                    const payload = await response.text();

                    const data = JSON.parse(payload.replace(")]}',", "")).default.timelineData;
                    const ts = transformTimelineData(data);
                } catch (error) {
                    console.error("FAILED FOR MULTILINE", error.message);
                }
            }
        }
    });
};
2uanDM commented 11 months ago

Instead, I have done a selenium script to crawls, but like @Raidus said, it blocks very fast if we use one IP address. So my script need to use proxies. Instead of intercept API request, I just mimik normal user like clicking on the download csv button, and then processing that CSV to get my desired results.

artkochnev commented 11 months ago

Instead, I have done a selenium script to crawls, but like @Raidus said, it blocks very fast if we use one IP address. So my script need to use proxies. Instead of intercept API request, I just mimik normal user like clicking on the download csv button, and then processing that CSV to get my desired results.

Did the same but noticed frequent errors in retrieving data by topics. E.g. get 0 for all values except for the last observation, which is 100

yokaja commented 7 months ago

@2uanDM Would you share your selenium script?

2uanDM commented 7 months ago

@2uanDM Would you share your selenium script?

Okay @yokaja, I will public it and hope that everyone can contribute to fix some problem existed

Helldez commented 7 months ago

Here is my code:

I repeat, I am not a professional developer, but I developed this code which directly takes the csv from the trends and then prints times and values. The problem is that it is a very heavy code to run (on pythonanywhere it takes a lot of CPU). Help me develop it to make it more usable.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import http.cookies
import pandas as pd
import urllib.parse
import os
import json
import time
from curl_cffi import requests as cffi_requests

MAX_RETRIES = 5

def trend_selenium(keywords):
    browser_versions = ["chrome99", "chrome100", "chrome101", "chrome104", "chrome107", "chrome110"]

    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--window-size=1920,1080")
    chrome_options.add_argument("--user-data-dir=./user_data")

    driver = webdriver.Chrome(options=chrome_options)

    encoded_keywords = urllib.parse.quote_plus(keywords)

    retries = 0
    file_downloaded = False
    while retries < MAX_RETRIES and not file_downloaded:
        response = cffi_requests.get("https://www.google.com", impersonate=browser_versions[retries % len(browser_versions)])
        cookies = response.cookies
        for cookie in cookies:
            cookie_str = str(cookie)
            cookie_dict = http.cookies.SimpleCookie(cookie_str)
            for key, morsel in cookie_dict.items():
                selenium_cookie = {
                    'name': key,
                    'value': morsel.value,
                    'domain': cookie.domain
                }
                driver.add_cookie(selenium_cookie)

        trends_url = f'https://trends.google.com/trends/explore?date=now%207-d&geo=US&q={encoded_keywords}'
        print(trends_url)
        driver.get(trends_url)

        excel_button_selector = "body > div.trends-wrapper > div:nth-child(2) > div > md-content > div > div > div:nth-child(1) > trends-widget > ng-include > widget > div > div > div > widget-actions > div > button.widget-actions-item.export > i"

        try:
            WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, excel_button_selector)))
            driver.find_element(By.CSS_SELECTOR, excel_button_selector).click()
            time.sleep(5)  # Aggiungi una pausa per attendere il download

            if os.path.exists('multiTimeline.csv'):
                file_downloaded = True
            else:
                print(f"File not downloaded. Attempt {retries + 1} of {MAX_RETRIES}...")
                retries += 1
                time.sleep(retries)  # Implementa un ritardo esponenziale
                driver.refresh()

        except Exception as e:
            print(f"Error during download attempt: {str(e)}")
            retries += 1
            time.sleep(retries)  # Implementa un ritardo esponenziale

    trend_data = {}
    if file_downloaded:
        try:
            trend_df = pd.read_csv('multiTimeline.csv', skiprows=2)
            trend_df['Time'] = pd.to_datetime(trend_df['Time']).dt.strftime('%Y-%m-%d %H:%M:%S')
            data_column = [col for col in trend_df.columns if col not in ['Time']][0]
            trend_data = dict(zip(trend_df['Time'], trend_df[data_column]))
            os.remove('multiTimeline.csv')
            trends_str = json.dumps(trend_data)
        except Exception as e:
            print(f"Error in reading or deleting the file 'multiTimeline.csv': {str(e)}")
    else:
        print("File not downloaded after the maximum number of attempts.")

    driver.quit()
    return trends_str

keywords = "test"
trends_str = trend_selenium(keywords)
print(trends_str)