deedy5 / duckduckgo_search

Search for words, documents, images, videos, news, maps and text translation using the DuckDuckGo.com search engine. Downloading files and images to a local hard drive.
MIT License
932 stars 117 forks source link

httpx.HTTPError on 3.9.5 #134

Closed GianfrancoCorrea closed 7 months ago

GianfrancoCorrea commented 7 months ago

Yesterday someone reported this bug, but he deleted the issue, so i don't know if it has some solution or what...

code:

async def async_search(query):
    try:
        async with AsyncDDGS() as ddgs:
            results = [r async for r in ddgs.text(query, max_results=5)]
            return results
    except Exception as e:
        print(e)
        return []

async def search_queries(queries):
   tasks = []
   for query in queries:
        tasks.append(asyncio.create_task(async_search(query)))
    results = await asyncio.gather(*tasks)
    return results

Debug log

2023-11-15 09:52:35.202 Uncaught app exception
Traceback (most recent call last):
  File "/Users/gianjsx/Documents/fuentes/.venv/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
  File "/Users/gianjsx/Documents/fuentes/app.py", line 49, in <module>
    asyncio.run(main())
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/gianjsx/Documents/fuentes/app.py", line 46, in main
    for result in results:
  File "/Users/gianjsx/Documents/fuentes/.venv/lib/python3.11/site-packages/duckduckgo_search/duckduckgo_search.py", line 96, in text
    for i, result in enumerate(results, start=1):
  File "/Users/gianjsx/Documents/fuentes/.venv/lib/python3.11/site-packages/duckduckgo_search/duckduckgo_search.py", line 148, in _text_api
    resp = self._get_url("GET", "https://links.duckduckgo.com/d.js", params=payload)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gianjsx/Documents/fuentes/.venv/lib/python3.11/site-packages/duckduckgo_search/duckduckgo_search.py", line 55, in _get_url
    raise ex
  File "/Users/gianjsx/Documents/fuentes/.venv/lib/python3.11/site-packages/duckduckgo_search/duckduckgo_search.py", line 48, in _get_url
    raise httpx._exceptions.HTTPError("")
httpx.HTTPError

Specify this information

deedy5 commented 7 months ago

Try using a proxy. The package works, all tests pass. Maybe your ip is blocked by the site.

fedjabosnic commented 7 months ago

Same issue here except I'm not using the async variant - and it seems very intermittent.

Admittedly not a large sample size but it mainly occurs when I'm handling multiple inbound http requests - some resource sharing issue maybe but what do I know :)

@deedy5 if it is actually to do with being blocked, could we have a nice way to handle this? I might be a bit of a noob but not exactly sure how to catch these exceptions...

Awesome work btw :)

deedy5 commented 7 months ago

If you send multiple requests in parallel, the site will block your ip for a while.

The solution is simple - either send requests sequentially in one stream, or use a proxy so that your ip is different for each request. https://github.com/deedy5/duckduckgo_search#using-proxy

fedjabosnic commented 7 months ago

Okay, is there a way to catch and handle these exceptions, at the moment it's happening behind the scenes

dmzio commented 7 months ago

Try using a proxy. The package works, all tests pass. Maybe your ip is blocked by the site.

today I started to see same errors. Sequential queries with DDGS() (not async). After 2-3 requests with interval of <10s API starts to respond with 202 and this causes HTTPError.
There's no mention in logs why _is_500_in_url(str(resp.url)) or resp.status_code == 202 added in a first place and can't find what does 202 mean at DDG. Is there a more graceful way to handle it, not just raising error after 2 quick retries?

KharchenkoDmitriy commented 7 months ago

Have the same issue which happens ocasionaly without frequent request (1 request per 10-20min) Edt: I was wrong, it sends 4-5 rquests in a row once in 10-20min Also tested over CLI, ddg respond with 202 on 3-4 request in a row

KharchenkoDmitriy commented 7 months ago

I've build a workaround for the limit Looks like the limit is ~2request per 10sec

import asyncio
from duckduckgo_search import AsyncDDGS

class AsyncRateLimitedActionWrapper:
    def __init__(self, rate_limit: int, time_period: float):
        """
        :param rate_limit: The maximum number of requests allowed per time period.
        :param time_period: Time period in seconds over which the rate limit applies.
        """
        self.rate_limit = rate_limit
        self.time_period = time_period
        self.slots = asyncio.Queue(maxsize=rate_limit)
        self.history = asyncio.Queue()
        self.generator = None

    async def _generate_slots(self):
        """
        Generate slots for the rate limit.
        """
        while True:
            if self.slots.empty() or self.history.empty():
                # no calls in the time period window
                await asyncio.sleep(self.time_period)
            else:
                # first call in the time period window
                first_time_call = await self.history.get()
                self.history.task_done()
                current_time = asyncio.get_event_loop().time()
                await asyncio.sleep(self.time_period - current_time + first_time_call)
                await self.slots.get() # put back the slot once call is out of the framed time period
                self.slots.task_done()

    async def _consume_slot(self):
        """
        Consume a slot.
        """
        if self.generator is None:
            self.generator = self._generate_slots()
            asyncio.create_task(self.generator)

        await self.slots.put(1)
        await self.history.put(asyncio.get_event_loop().time())

    async def perform(self, action: callable, *args, **kwargs):
        """
        Asynchronously make an action, respecting the rate limit.
        """
        await self._consume_slot()

        return await action(*args, **kwargs)

limit_wrapper = AsyncRateLimitedActionWrapper(2, 10)

async def _search(search_query):
    async with AsyncDDGS() as ddgs:
        results = [r async for r in ddgs.text(search_query)]
        return results

async def search(search_queries) -> list[dict]:
    return await limit_wrapper.perform(_search, search_queries)
deedy5 commented 7 months ago

Thank you all for finding the error. The site sometimes makes changes. Fixed in version v3.9.6.

dmzio commented 7 months ago

thanks, it mostly works. Started to fail after ~10 min (with a rate of 3 useful reqs per 10s):

11:11:00: HTTP Request: POST https://duckduckgo.com "HTTP/2 200 OK"
11:11:00: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=0&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=50&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=100&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=150&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=200&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=250&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=300&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=350&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=400&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=450&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"
11:11:01: HTTP Request: GET https://links.duckduckgo.com/d.js?q=Officny&kl=wt-wt&l=wt-wt&bing_market=wt-WT&s=500&df=y&vqd=4-46289096192774020376654725798572440979&o=json&sp=0&ex=-1 "HTTP/2 202 Accepted"

what does that s mean? Perhaps we need to apply smarter backoff instead of just 10 immediate repeats?

deedy5 commented 7 months ago

Try using a proxy. 202 reponse code is their way of blocking ip at the moment. s is pagination.

deedy5 commented 7 months ago

I tested v3.9.6, text() function with 100 random keywords. There are no 202 responses at all. Working in a single thread does not cause errors. But if you work in multithreaded mode, errors will occur. In this case you should use a proxy.