deedy5 / duckduckgo_search

Search for words, documents, images, videos, news, maps and text translation using the DuckDuckGo.com search engine. Downloading files and images to a local hard drive.
MIT License
1.18k stars 132 forks source link

403 forbidden error when hosting script #101

Closed Nevrai closed 1 year ago

Nevrai commented 1 year ago

Describe the bug

I love duckduckgo-search, but I’ve been having issues with fetching images when hosting my script on Cybrancee. My script uses Python 3.10.12.

Whilst using the duckduckgo-search library to fetch images from DuckDuckGo, I encounter a HTTPError 403 Client Error: Forbidden for url error. This issue does not occur when running the bot locally – only when hosted on Cybrancee, which uses a Pterodactyl panel. Scraping web pages or search engines works fine, and fetching search results with duckduckgo-search works fine, too. Fetching images is the only thing that does not work.

I also tried proxies, headers, and a user agent. However, I still have the same problem.

For some odd reason, I’m able to scrape DuckDuckGo search results with duckduckgo-search just fine on my host:

ddg_link = DDGS(headers=new_headers, proxies=proxies, timeout=15).text(q)

However, when scraping image results instead, it does not work. Code:

rand_ua = get_ua()

logging.debug(f'[ddg_img.py] User agent: {rand_ua}')

headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
        "Dnt": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": rand_ua,
        }

ddgs = DDGS(headers=headers, proxies=proxies, timeout=15)

async def get_ddg_image(query, first=False):
    # Remove all punctuation marks except hyphens from query
    query = re.sub(r'[^\w\s-]', '', query)
    logging.debug(f'[get_ddg_image()] query: {query}')

    keywords = query
    ddgs_images_gen = ddgs.images(
        keywords,
        region='wt-wt',
        safesearch='On'
    )
    # Get random image
    images = list(itertools.islice(ddgs_images_gen, 10))

    if first:
        # Get the first image
        image = images[0] if images else None
    else:
        # Get random image
        image = random.choice(images) if images else None

May be related to #100; however, unlike that issue, it does not happen periodically for me. It happens with every attempt – but only when hosting, not when running the script locally.

I was using version 3.2.0 of duckduckgo-search, then updated to the latest version, 3.8.3. However, the issue still occurs in the same way it did before.

I have seen #84 and #98. However, you (@deedy5) said that updating might fix it, but it did not. You also said that it’s not a library problem and that a proxy or increasing the time between requests might fix the issue, but in my case, it occurs every time, even if I haven’t made any recent requests, and I have tried both with and without proxies.

The strange anomaly is that it functions perfectly locally but not when hosting on Cybrancee (I have not tried other hosts) – and that using the same library to scrape DDG search results works perfectly with the same headers, UA, and proxies, but when trying to get images, it does not work. I’m not sure what is causing this, but if you could offer some assistance in fixing this issue, it would be much appreciated, as I am quite lost!

Errors

WARNING:duckduckgo_search.duckduckgo_search:_get_url() https://duckduckgo.com/i.js HTTPError 403 Client Error: Forbidden for url: https://duckduckgo.com/i.js?l=wt-wt&o=json&s=0&q=Potato+picture&vqd=4-7287769708002951745556569444305599608&f=%2C%2C%2C%2C%2C&p=1
ERROR:__main__:Unhandled error in on_message
Traceback (most recent call last):
  File "/home/container/.local/lib/python3.10/site-packages/discord/client.py", line 441, in _run_event
    await coro(*args, **kwargs)
  File "/home/container/script.py", line 6134, in on_message
    image_query, image_url, image_title = await fetch_image(msg, ai_response, server_id, channel_id, should_fetch, fetch_image_type)
  File "/home/container/script.py", line 2799, in fetch_image
    image_url, image_title = await get_ddg_image(image_query)
  File "/home/container/ddg_img.py", line 73, in get_ddg_image
    images = list(itertools.islice(ddgs_images_gen, 10))
  File "/home/container/.local/lib/python3.10/site-packages/duckduckgo_search/duckduckgo_search.py", line 230, in images
    resp = self._get_url("GET", "https://duckduckgo.com/i.js", params=payload)
  File "/home/container/.local/lib/python3.10/site-packages/duckduckgo_search/duckduckgo_search.py", line 69, in _get_url
    raise ex
  File "/home/container/.local/lib/python3.10/site-packages/duckduckgo_search/duckduckgo_search.py", line 64, in _get_url
    resp.raise_for_status()
  File "/home/container/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://duckduckgo.com/i.js?l=wt-wt&o=json&s=0&q=Potato+picture&vqd=4-7287769708002951745556569444305599608&f=%2C%2C%2C%2C%2C&p=1

Information

Nevrai commented 1 year ago

I solved the issue myself by making sure to use the latest version of duckduckgo-search, 3.8.3, and making sure I was using version 23.1.0 of aiofiles. I also made sure I was using the latest versions of click, httpx, and lxml. Thankfully, that solved it!

max3ndeavour commented 11 months ago

Hi, Got the same error with 3.8.5 version. Is this really working anymore or the restrictions are too strict ? I'm trying with only one image so no issues of query frequency here.

deedy5 commented 11 months ago

Use the latest version

max3ndeavour commented 11 months ago

I'm in Kaggle and 3.8.5 seems to be the best possible to be installed. Too bad. Thanks for the feedback. I write this in case someone has a workaround for this