Closed zar3bski closed 1 year ago
Extractor.__init__
is initializing a session factory. This already requires an event loop and you should ideally be calling this from within an async function. See also https://docs.aiohttp.org/en/stable/faq.html#why-is-creating-a-clientsession-outside-of-an-event-loop-dangerousI suggest instead one of two possibilities:
class Connector:
def __init__(self, conf: dict):
self.conf = conf
self._session = None
def session_factory(self) -> ClientSession:
if not self._session:
self._session = ClientSession(
headers=self.conf.get("headers", {}),
)
return self._session
def __reduce__(self):
return (Connector.__init__, (self.conf,))
def get_connector():
return Connector(...)
connector = client.submit(get_connector, conf)
def do_stuff(foo, connector):
....
client.map(do_stuff, range(10), connector=connector)
Dask will then replicate Connector
to every worker where it will act as a session cache
Thanks for your answer, @fjetter. I already explored 2 here: not easy to serialize these custom objects. Is the following implementing 1. properly?
from functools import cache
class Connector:
def __init__(self, conf: dict):
self.conf = conf
@cache
def session_factory(self) -> ClientSession:
session = ClientSession(
headers=self.conf.get("headers", {}),
)
return session
# No changes on Extractor
async def main():
client = await Client(asynchronous=True)
extractor = Extractor(Connector({}))
futures = client.map(
extractor.job,
URLS,
retries=5,
pure=False,
)
_ = await client.gather(futures)
Sorry for the late reply. I think that should work. I'll close this ticket now but if you need further assistance, please reopen or consider posting a question in https://dask.discourse.group/ which is where we handle most of the usage related questions.
Describe the issue: I am trying to integrate some legacy code into dask.distributed. It involves to instantiate aiohttp.ClientSession on each worker only once to get multiple urls. ClientSession are not easily serializable so I tried to implement a actor pattern to instantiate my
Extractor
on the workers. In my local / dev context (I haven't tried on a real cluster yet), I am facingI tried several approaches but asyncio's
get_running_loop
never gets its loopMinimal Complete Verifiable Example:
Anything else we need to know?: I wondered whether setting
Worker
more explicitly could overcome this difficulty but I haven't been lucky with this attemptEnvironment: