ReactiveX / RxPY

ReactiveX for Python
https://rxpy.rtfd.io
MIT License
4.72k stars 356 forks source link

[Question] shiping data between asyncio and rxpy #669

Closed boholder closed 1 year ago

boholder commented 1 year ago

My goal is to constantly switch between rxpy and asyncio, so that they are connected together and the data stream goes through the entire chain.

In my project (an automation command line tool for Twitter), the initial data flows to the asynchronous HTTP client as an Observable or a normal list, and the client response flows to the processing chain as a new Observable, where there may be a need to request the asynchronous HTTP client again... https://github.com/boholder/puntgun/issues/17 Therefore the data stream is constantly switched between the two frameworks for processing. For example, the data flow will goes like:

usernames: list[str] -> **async client** -> list[User] -> Observable[User]  -> (1)
(1) -> filters -> Observable[decision] (2)
(1) -> filters that need to query API -> **async client** -> Observable[decision] (2)
(2) -> actions -> **async client** -> exit

After searching, I learned that:

  1. Sadly the aioreactive doesn't meet my needs, there are several operators (share(), buffer_with_count()...) I already use that it doesn't implement. https://github.com/dbrattli/aioreactive

  2. It is impossible to transfer data to an asynchronous function via an operator, because you can't call an async function in on_next(). https://github.com/ReactiveX/RxPY/issues/649#issuecomment-1159512451

  3. But we can store the data at somewhere else (an asyncio.Queue) , writing a custom async operator which runs an infinite loop, to perform the async work, we can even add more operators to process the async work's result. https://github.com/ReactiveX/RxPY/issues/592#issue-1105185161 https://github.com/ReactiveX/RxPY/issues/571#issue-878054692 https://blog.oakbits.com/rxpy-and-asyncio.html

I experimented a bit based on this friendly answer, but there are still something where I'm not sure how to implement:

  1. When we use asyncio at the same time, can rxpy guarantee to execute all operators on all data till observer before exiting?

  2. (asyncio related) The logic of putting data in the "first segment" chain into the "second segment" chain cannot wait for the Future, and some tasks in the "second" chain are canceled before they are executed. How do I wait for all tasks to finish before exit? (A search shows that asyncio.gather() only cares about the coroutine being passed in, so manual control of the event loop might solve the problem?)

  3. (still asyncio related) asyncio needs to manually string coroutines together using await (call add()), is there a way to automatically pass the data stream from rxpy into the asyncio.queue? (still have the problem of not being able to await Future) Is this example a solution? https://github.com/ReactiveX/RxPY/blob/master/examples/asyncio/toasyncgenerator.py

Thanks for reading my question.

import asyncio
import time
from collections import namedtuple

import reactivex as rx
from reactivex import operators as op
from reactivex.disposable import Disposable
from reactivex.scheduler.eventloop import AsyncIOScheduler
from reactivex.subject import Subject

start = time.time()

def ts():
    return f"{time.time() - start:.3f}"

ACTION_DURATION = 1.0

first_subject = Subject()
first_async_action = Subject()
second_subject = Subject()

Data = namedtuple("Data", ["api", "param", "future"])

async def async_calling_api(data: Data):
    """Some async processing, like sending/writing data."""
    print(f"{ts()} [A]sync action started  api:{data.api} param:{data.param}")
    # process the data with async function
    await asyncio.sleep(ACTION_DURATION)
    print(f"{ts()} [A]sync action finished api:{data.api} param:{data.param}")
    # process finished, return the response
    return f"[{data.param}]"

def serialize_map_async(mapper):
    def _serialize_map_async(source):
        def on_subscribe(observer, scheduler):
            # separate different api callings into different task queues
            queues = {k: asyncio.Queue() for k in range(0, 3)}

            async def infinite_loop(q: asyncio.Queue[Data]):
                try:
                    while True:
                        data = await q.get()
                        resp = await mapper(data)
                        observer.on_next(resp)
                        data.future.set_result(resp)
                except Exception as e:
                    observer.on_error(e)

            def on_next(data: Data):
                # take data from upstream ( calls on subject.on_next() trigger it )
                # synchronous -> asynchronous by putting elements into queue
                try:
                    queues[data.api].put_nowait(data)
                except Exception as e:
                    observer.on_error(e)

            tasks = [asyncio.create_task(infinite_loop(q)) for q in queues.values()]

            d = source.subscribe(
                on_next=on_next,
                on_error=observer.on_error,
                on_completed=observer.on_completed,
            )

            def dispose():
                d.dispose()
                [task.cancel() for task in tasks]

            return Disposable(dispose)

        return rx.create(on_subscribe)

    return _serialize_map_async

async def setup():
    loop = asyncio.get_event_loop()
    first_subject.pipe(
        serialize_map_async(async_calling_api),
        # The futures created here was not waited for, so it was not added to asyncio's chain,
        # resulting in the following gather only guaranteeing all the tasks of the first level,
        # and some second level task was canceled before it was executed.
        op.do_action(lambda x: second_subject.on_next(Data(2, x, asyncio.Future()))),
    ).subscribe(
        on_next=lambda param: print(f"{ts()} [O]bserver [1] received: {param}"),
        scheduler=AsyncIOScheduler(loop)
    )

    second_subject.pipe(serialize_map_async(async_calling_api), ).subscribe(
        on_next=lambda param: print(f"{ts()} [O]bserver [2] received: {param}"),
        scheduler=AsyncIOScheduler(loop)
    )

async def add(api: int, param: str):
    future = asyncio.Future()
    first_subject.on_next(Data(api, param, future))
    return await future

async def main():
    await setup()
    # I wonder if there is a way to write "await rx.from..."
    #
    # rx.from_iterable("a", "b").pipe(
    #  op.do(await ...)
    # )

    a = await asyncio.gather(add(0, "0a"), add(0, "0b"), add(1, "1a"), add(1, "1b"), )
    print(f"---> {a}")

asyncio.run(main())
0.003 [A]sync action started  api:0 param:0a
0.003 [A]sync action started  api:1 param:1a
1.007 [A]sync action finished api:0 param:0a
1.007 [O]bserver [1] received: [0a]
1.007 [A]sync action started  api:0 param:0b
1.007 [A]sync action finished api:1 param:1a
1.007 [O]bserver [1] received: [1a]
1.007 [A]sync action started  api:1 param:1b
1.007 [A]sync action started  api:2 param:[0a]
5.021 [A]sync action finished api:0 param:0b
2.021 [O]bserver [1] received: [0b]
2.021 [A]sync action finished api:1 param:1b
2.021 [O]bserver [1] received: [1b]
2.021 [A]sync action finished api:2 param:[0a]
2.021 [O]bserver [2] received: [[0a]]
2.021 [A]sync action started  api:2 param:[1a]
---> ['[0a]', '[0b]', '[1a]', '[1b]']
boholder commented 1 year ago

Since we can't smoothly connect rxpy and asyncio together, we just need to let them work separately (in the same one thread), producing data for each other, and exiting (and cleaning context) normally after all the data is processed.

There are four subtasks to implement:

  1. make rxpy and asyncio works in same one thread.
  2. find a way to pass data flow from rxpy to asyncio ( asyncio.Queue()? )
  3. find a way to pass data flow from asyncio to another rxpy pipe ( already achieved in example code)
  4. guarantee all elements in pyrx pipes, all coroutines in asyncio event-loop, are finished, then exit.