Closed josalhor closed 6 years ago
I'm sorry but latest aiohttp
doesn't use aiodns
by default. You should make async dns resolver before creating client session. Thus your snippet looks incomplete.
You're right, for whatever reason I was running it in a machine with aiohttp 1.0.whatever. I still think this problem can be reproduced in the last version, I'll try later.
Async DNS resolver was disabled by default in aiohttp 1.1: aio-libs/aiohttp#559
It is not 100% compatible with standard threading one.
To reproduce the functionality in newer versions explicitly install async resolver: https://docs.aiohttp.org/en/stable/client.html#resolving-using-custom-nameservers
Code:
import asyncio
import aiohttp
from aiohttp.resolver import AsyncResolver
async def getHeaders(url, sema):
async with aiohttp.ClientSession(headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}, connector=aiohttp.TCPConnector(verify_ssl=False, resolver= AsyncResolver(nameservers=["8.8.8.8", "8.8.4.4"]))) as session:
async with sema:
try:
async with session.head(url) as response:
try:
if "html" in response.headers["Content-Type"]:
return url, True
else:
return url, False
except:
return url, False
except:
return url, False
def removeUrlsWithoutHtml(setOfUrls, MAXitems):
listOfUrls = list(setOfUrls)
while(len(listOfUrls) != 0):
blockurls = []
print("URLS left to process: " + str(len(listOfUrls)))
items = 0
for num in range(0, len(listOfUrls)):
if num < MAXitems:
blockurls.append(listOfUrls[num - items])
listOfUrls.remove(listOfUrls[num - items])
items += 1
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
semaphoreHeaders = asyncio.Semaphore(50)
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, semaphoreHeaders) for url in blockurls)))
for header in data:
if False == header[1]:
setOfUrls.remove(header[0])
MAXitems = 10
setOfUrls = {'http://www.google.com', 'http://www.reddit.com'}
removeUrlsWithoutHtml(setOfUrls, MAXitems)
for link in list(setOfUrls):
print(link)
From the original code I added a proper user agent, the implementation of Async resolver is basically a copy-paste of the documentation.
Got error:
Exception ignored in: <bound method DNSResolver._sock_state_cb of <aiodns.DNSResolver object at 0x06C17CF0>>
Traceback (most recent call last):
File "USER\AppData\Local\Programs\Python\Python36-32\lib\site-packages\aiodns\__init__.py", line 85, in _sock_state_cb
self.loop.add_reader(fd, self._handle_event, fd, READ)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 453, in add_reader
raise NotImplementedError
NotImplementedError:
Edit: From the original coded I also implemented the session inside the coroutine.
@asvetlov If I try to input a lot of urls, I aslo get another error: Code with the input on setOfUrls: Note I've disabled async resolver
import asyncio
import aiohttp
from aiohttp.resolver import AsyncResolver
async def getHeaders(url, sema):#resolver= AsyncResolver(nameservers=["8.8.8.8", "8.8.4.4"])
async with aiohttp.ClientSession(headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}, connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
async with sema:
try:
async with session.head(url) as response:
try:
if "html" in response.headers["Content-Type"]:
return url, True
else:
return url, False
except:
return url, False
except:
return url, False
def removeUrlsWithoutHtml(setOfUrls, MAXitems):
listOfUrls = list(setOfUrls)
while(len(listOfUrls) != 0):
blockurls = []
print("URLS left to process: " + str(len(listOfUrls)))
items = 0
for num in range(0, len(listOfUrls)):
if num < MAXitems:
blockurls.append(listOfUrls[num - items])
listOfUrls.remove(listOfUrls[num - items])
items += 1
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
semaphoreHeaders = asyncio.Semaphore(50)
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, semaphoreHeaders) for url in blockurls)))
for header in data:
if False == header[1]:
setOfUrls.remove(header[0])
MAXitems = 10
setOfUrls = {'https://apis.google.com', 'https://www.google.com/calendar?tab=wc', 'https://accounts.google.com/ServiceLogin?hl=es&passive=true&continue=https://www.google.es/%3Fgfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl', 'https://www.google.com/gen_204?', 'https://www.google.es/webhp?hl=es&dcr=0&sa=X&ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQPAgD', 'https://play.google.com/?hl=es&tab=w8', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&hl=eu&source=homepage&sa=X&ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAo', 'https://books.google.es/bkshp?hl=es&tab=wp', 'https://www.youtube.com/?gl=ES', 'https://adservice.google.es/adsid/google/ui', 'https://play.google.com/log?format=json', 'https://keep.google.com/', 'https://www.google.es/intl/es/options/', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&hl=gl&source=homepage&sa=X&ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAk', 'https://translate.google.es/?hl=es&tab=wT', 'https://consent.google.com?hl\\u003des\\u0026origin\\u003dhttps://www.google.es\\u0026continue\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\u0026if\\u003d1\\u0026l\\u003d0\\u0026m\\u003d0\\u0026pc\\u003ds\\u0026wp\\u003d71', 'https://www.google.es/services/?subid=ww-ww-et-g-awa-a-g_hpbfoot1_1!o2&utm_source=google.com&utm_medium=referral&utm_campaign=google_hpbfooter&fg=1', 'https://www.google.com/?gfe_rd=cr&dcr=0&ei=zBwtWszREYGZX47QkKgI&gws_rd=ssl,cr&fg=1', 'https://www.blogger.com/?tab=wj', 'https://www.google.es/webhp?tab=ww', 'https://www.google.es/preferences?hl=es', 'https://www.google.es/preferences?hl=es&fg=1', 'https://mail.google.com/mail/?tab=wm', 'https://consent.google.es?hl\\u003des\\u0026origin\\u003dhttps://www.google.es\\u0026continue\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\u0026if\\u003d1\\u0026l\\u003d0\\u0026m\\u003d0\\u0026pc\\u003ds\\u0026wp\\u003d71', 'https://consent.google.com/status?continue=https://www.google.es&pc=s&timestamp=1512905932', 'https://www.google.es/setprefs?sig=0_fm9MOZRAXNmSEF8OkKdOwopqi2M%3D&hl=ca&source=homepage&sa=X&ved=0ahUKEwiprMzlrf_XAhXJaxQKHffvC9UQ2ZgBCAg', 'https://www.google.es/intl/es_es/about/?utm_source=google.com&utm_medium=referral&utm_campaign=hp-footer&fg=1', 'https://consent.google.com?hl\\\\u003des\\\\u0026origin\\\\u003dhttps://www.google.es\\\\u0026continue\\\\u003dhttps://www.google.es/?gfe_rd%3Dcr%26dcr%3D0%26ei%3DzBwtWszREYGZX47QkKgI%26gws_rd%3Dssl\\\\u0026if\\\\u003d1\\\\u0026l\\\\u003d0\\\\u0026m\\\\u003d0\\\\u0026pc\\\\u003ds\\\\u0026wp\\\\u003d71\\', 'https://www.google.com/contacts/?hl=es&tab=wC', 'https://maps.google.es/maps?hl=es&tab=wl', 'http://schema.org/WebPage', 'https://docs.google.com/document/?usp=docs_alc', 'https://hangouts.google.com/', 'https://www.google.es/imghp?hl=es&tab=wi', 'https://jmt17.google.com/log', 'http://www.google.es/shopping?hl=es&tab=wf', 'https://www.google.es/intl/es_es/ads/?subid=ww-ww-et-g-awa-a-g_hpafoot1_1!o2&utm_source=google.com&utm_medium=referral&utm_campaign=google_hpafooter&fg=1'}
removeUrlsWithoutHtml(setOfUrls, MAXitems)
for link in list(setOfUrls):
print(link)
Error:
Exception ignored in: <bound method _ProactorBasePipeTransport.__del__ of <_ProactorSocketTransport closing fd=-1 read=<_OverlappedFuture cancelled>>>
Traceback (most recent call last):
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\proactor_events.py", line 97, in __del__
self.close()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\proactor_events.py", line 84, in close
self._loop.call_soon(self._call_connection_lost, None)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 574, in call_soon
self._check_closed()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 357, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
I'm well aware this is a separate error that should be discussed elsewhere because it's not in the scope of aiodns, I'm just pointing it out here in case both errors are on my end and are somehow (although unlikely) correlated
The API c-ares provides deals with low level fds, which is what aiodns in turn uses to function. If a given event loop implementation doesn't support those methods then aiodns cannot work.
Hi,
I've had a few problems trying to implement asyncio and aiohttp into my script running out of sockets to perform the connection in SelectorEventLoop. I've then tried to use ProactorEventLoop on Windows that doesn't seem to not have this limitation. However when I try:
Note the use of semaphore and chuncking to try to get around the selector limit issue that I face if I replace
with:
loop = asyncio.get_event_loop()
With the current configuratioon it raises:
Note my direction path has been manually changed to USER
Python documentation says: https://docs.python.org/3/library/asyncio-eventloops.html#asyncio.ProactorEventLoop
add_reader() and add_writer() only accept file descriptors of sockets
Is aiodns not supported with ProactorEventLoop? Is this some type of weird bug? Is aiodns fully supported on Windows?
I can provide more info, but in case you need a little bit more background I've been derived here by @asvetlov in the following stack overflow question: https://stackoverflow.com/questions/47675410/python-asyncio-aiohttp-valueerror-too-many-file-descriptors-in-select-on-win