aio-libs / aiodns

Simple DNS resolver for asyncio
https://pypi.python.org/pypi/aiodns
MIT License
538 stars 69 forks source link

ProactorEventLoop on Windows #36

Closed josalhor closed 6 years ago

josalhor commented 6 years ago

Hi,

I've had a few problems trying to implement asyncio and aiohttp into my script running out of sockets to perform the connection in SelectorEventLoop. I've then tried to use ProactorEventLoop on Windows that doesn't seem to not have this limitation. However when I try:

import asyncio
import aiohttp

async def getHeaders(url, session, sema):
    async with session:
        async with sema:
            try:
                async with session.head(url) as response:
                    try:
                        if "html" in response.headers["Content-Type"]:
                            return url, True
                        else:
                            return url, False
                    except:
                        return url, False
            except:
                return url, False

def removeUrlsWithoutHtml(setOfUrls, MAXitems):
    listOfUrls = list(setOfUrls)
    while(len(listOfUrls) != 0):
        blockurls = []
        print("URLS left to process: " + str(len(listOfUrls)))
        items = 0
        for num in range(0, len(listOfUrls)):
            if num < MAXitems:
                blockurls.append(listOfUrls[num - items])
                listOfUrls.remove(listOfUrls[num - items])
                items += 1
        loop = asyncio.ProactorEventLoop()
        asyncio.set_event_loop(loop)
        semaphoreHeaders = asyncio.Semaphore(50)
        session = aiohttp.ClientSession()
        data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
        for header in data:
            if False == header[1]:
                setOfUrls.remove(header[0])

MAXitems = 10
setOfUrls = {'http://www.google.com', 'http://www.reddit.com'}
removeUrlsWithoutHtml(setOfUrls, MAXitems)

for link in list(setOfUrls):
    print(link)

Note the use of semaphore and chuncking to try to get around the selector limit issue that I face if I replace

        loop = asyncio.ProactorEventLoop()
        asyncio.set_event_loop(loop)

with: loop = asyncio.get_event_loop()

With the current configuratioon it raises:

Exception ignored in: <bound method DNSResolver._sock_state_cb of <aiodns.DNSResolver object at 0x0616F830>>
Traceback (most recent call last):
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\site-packages\aiodns\__init__.py", line 85, in _sock_state_cb
    self.loop.add_reader(fd, self._handle_event, fd, READ)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 453, in add_reader
    raise NotImplementedError
NotImplementedError:

Note my direction path has been manually changed to USER

Python documentation says: https://docs.python.org/3/library/asyncio-eventloops.html#asyncio.ProactorEventLoop

add_reader() and add_writer() only accept file descriptors of sockets

Is aiodns not supported with ProactorEventLoop? Is this some type of weird bug? Is aiodns fully supported on Windows?

I can provide more info, but in case you need a little bit more background I've been derived here by @asvetlov in the following stack overflow question: https://stackoverflow.com/questions/47675410/python-asyncio-aiohttp-valueerror-too-many-file-descriptors-in-select-on-win

asvetlov commented 6 years ago

I'm sorry but latest aiohttp doesn't use aiodns by default. You should make async dns resolver before creating client session. Thus your snippet looks incomplete.

josalhor commented 6 years ago

You're right, for whatever reason I was running it in a machine with aiohttp 1.0.whatever. I still think this problem can be reproduced in the last version, I'll try later.

asvetlov commented 6 years ago

Async DNS resolver was disabled by default in aiohttp 1.1: aio-libs/aiohttp#559

It is not 100% compatible with standard threading one.

To reproduce the functionality in newer versions explicitly install async resolver: https://docs.aiohttp.org/en/stable/client.html#resolving-using-custom-nameservers

josalhor commented 6 years ago

Code:

import asyncio
import aiohttp
from aiohttp.resolver import AsyncResolver

async def getHeaders(url, sema):
    async with aiohttp.ClientSession(headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}, connector=aiohttp.TCPConnector(verify_ssl=False, resolver= AsyncResolver(nameservers=["8.8.8.8", "8.8.4.4"]))) as session:
        async with sema:
            try:
                async with session.head(url) as response:
                    try:
                        if "html" in response.headers["Content-Type"]:
                            return url, True
                        else:
                            return url, False
                    except:
                        return url, False
            except:
                return url, False

def removeUrlsWithoutHtml(setOfUrls, MAXitems):
    listOfUrls = list(setOfUrls)
    while(len(listOfUrls) != 0):
        blockurls = []
        print("URLS left to process: " + str(len(listOfUrls)))
        items = 0
        for num in range(0, len(listOfUrls)):
            if num < MAXitems:
                blockurls.append(listOfUrls[num - items])
                listOfUrls.remove(listOfUrls[num - items])
                items += 1
        loop = asyncio.ProactorEventLoop()
        asyncio.set_event_loop(loop)
        semaphoreHeaders = asyncio.Semaphore(50)
        data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, semaphoreHeaders) for url in blockurls)))
        for header in data:
            if False == header[1]:
                setOfUrls.remove(header[0])

MAXitems = 10
setOfUrls = {'http://www.google.com', 'http://www.reddit.com'}
removeUrlsWithoutHtml(setOfUrls, MAXitems)

for link in list(setOfUrls):
    print(link)

From the original code I added a proper user agent, the implementation of Async resolver is basically a copy-paste of the documentation.

Got error:

Exception ignored in: <bound method DNSResolver._sock_state_cb of <aiodns.DNSResolver object at 0x06C17CF0>>
Traceback (most recent call last):
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\site-packages\aiodns\__init__.py", line 85, in _sock_state_cb
    self.loop.add_reader(fd, self._handle_event, fd, READ)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 453, in add_reader
    raise NotImplementedError
NotImplementedError:

Edit: From the original coded I also implemented the session inside the coroutine.

josalhor commented 6 years ago

@asvetlov If I try to input a lot of urls, I aslo get another error: Code with the input on setOfUrls: Note I've disabled async resolver

import asyncio
import aiohttp
from aiohttp.resolver import AsyncResolver

async def getHeaders(url, sema):#resolver= AsyncResolver(nameservers=["8.8.8.8", "8.8.4.4"])
    async with aiohttp.ClientSession(headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}, connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
        async with sema:
            try:
                async with session.head(url) as response:
                    try:
                        if "html" in response.headers["Content-Type"]:
                            return url, True
                        else:
                            return url, False
                    except:
                        return url, False
            except:
                return url, False

def removeUrlsWithoutHtml(setOfUrls, MAXitems):
    listOfUrls = list(setOfUrls)
    while(len(listOfUrls) != 0):
        blockurls = []
        print("URLS left to process: " + str(len(listOfUrls)))
        items = 0
        for num in range(0, len(listOfUrls)):
            if num < MAXitems:
                blockurls.append(listOfUrls[num - items])
                listOfUrls.remove(listOfUrls[num - items])
                items += 1
        loop = asyncio.ProactorEventLoop()
        asyncio.set_event_loop(loop)
        semaphoreHeaders = asyncio.Semaphore(50)
        data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, semaphoreHeaders) for url in blockurls)))
        for header in data:
            if False == header[1]:
                setOfUrls.remove(header[0])

MAXitems = 10
setOfUrls = {'https://apis.google.com', 'https://www.google.com/calendar?tab=wc', 'https://accounts.google.com/ServiceLogin?hl=es&amp;passive=true&amp;continue=https://www.google.es/%3Fgfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl', 'https://www.google.com/gen_204?', 'https://www.google.es/webhp?hl=es&amp;dcr=0&amp;sa=X&amp;ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQPAgD', 'https://play.google.com/?hl=es&amp;tab=w8', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&amp;hl=eu&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAo', 'https://books.google.es/bkshp?hl=es&amp;tab=wp', 'https://www.youtube.com/?gl=ES', 'https://adservice.google.es/adsid/google/ui', 'https://play.google.com/log?format=json', 'https://keep.google.com/', 'https://www.google.es/intl/es/options/', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&amp;hl=gl&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAk', 'https://translate.google.es/?hl=es&amp;tab=wT', 'https://consent.google.com?hl\\u003des\\u0026origin\\u003dhttps://www.google.es\\u0026continue\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\u0026if\\u003d1\\u0026l\\u003d0\\u0026m\\u003d0\\u0026pc\\u003ds\\u0026wp\\u003d71', 'https://www.google.es/services/?subid=ww-ww-et-g-awa-a-g_hpbfoot1_1!o2&amp;utm_source=google.com&amp;utm_medium=referral&amp;utm_campaign=google_hpbfooter&amp;fg=1', 'https://www.google.com/?gfe_rd=cr&amp;dcr=0&amp;ei=zBwtWszREYGZX47QkKgI&amp;gws_rd=ssl,cr&amp;fg=1', 'https://www.blogger.com/?tab=wj', 'https://www.google.es/webhp?tab=ww', 'https://www.google.es/preferences?hl=es', 'https://www.google.es/preferences?hl=es&amp;fg=1', 'https://mail.google.com/mail/?tab=wm', 'https://consent.google.es?hl\\u003des\\u0026origin\\u003dhttps://www.google.es\\u0026continue\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\u0026if\\u003d1\\u0026l\\u003d0\\u0026m\\u003d0\\u0026pc\\u003ds\\u0026wp\\u003d71', 'https://consent.google.com/status?continue=https://www.google.es&amp;pc=s&amp;timestamp=1512905932', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&amp;hl=ca&amp;source=homepage&amp;sa=X&amp;ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAg', 'https://www.google.es/intl/es_es/about/?utm_source=google.com&amp;utm_medium=referral&amp;utm_campaign=hp-footer&amp;fg=1', 'https://consent.google.com?hl\\\\u003des\\\\u0026origin\\\\u003dhttps://www.google.es\\\\u0026continue\\\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\\\u0026if\\\\u003d1\\\\u0026l\\\\u003d0\\\\u0026m\\\\u003d0\\\\u0026pc\\\\u003ds\\\\u0026wp\\\\u003d71\\', 'https://www.google.com/contacts/?hl=es&amp;tab=wC', 'https://maps.google.es/maps?hl=es&amp;tab=wl', 'http://schema.org/WebPage', 'https://docs.google.com/document/?usp=docs_alc', 'https://hangouts.google.com/', 'https://www.google.es/imghp?hl=es&amp;tab=wi', 'https://jmt17.google.com/log', 'http://www.google.es/shopping?hl=es&amp;tab=wf', 'https://www.google.es/intl/es_es/ads/?subid=ww-ww-et-g-awa-a-g_hpafoot1_1!o2&amp;utm_source=google.com&amp;utm_medium=referral&amp;utm_campaign=google_hpafooter&amp;fg=1'}
removeUrlsWithoutHtml(setOfUrls, MAXitems)

for link in list(setOfUrls):
    print(link)

Error:

Exception ignored in: <bound method _ProactorBasePipeTransport.__del__ of <_ProactorSocketTransport closing fd=-1 read=<_OverlappedFuture cancelled>>>
Traceback (most recent call last):
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\proactor_events.py", line 97, in __del__
    self.close()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\proactor_events.py", line 84, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 574, in call_soon
    self._check_closed()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 357, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

I'm well aware this is a separate error that should be discussed elsewhere because it's not in the scope of aiodns, I'm just pointing it out here in case both errors are on my end and are somehow (although unlikely) correlated

saghul commented 6 years ago

The API c-ares provides deals with low level fds, which is what aiodns in turn uses to function. If a given event loop implementation doesn't support those methods then aiodns cannot work.