AlexandreSenpai / Enma

Enma is a Python library designed to fetch and download manga and doujinshi data from many sources including Manganato and NHentai.
MIT License
80 stars 16 forks source link

Async Generator Search Pages #14

Closed NadieFiind closed 3 years ago

NadieFiind commented 3 years ago

Get multiple search pages at once with max number of pages using an async generator.

AlexandreSenpai commented 3 years ago

Hello @NadieFiind o7 Thank you very much for your contrib.

Why don't we use ensure_future to build asynchronous tasks instead building an async generator?

something like:

    async def search_pages(self, query: str, sort: str=None, max_pages: int=1) -> List[SearchPage]:
        TASKS = []

        for page in range(1, max_pages + 1):
            task = asyncio.ensure_future(self.search(query=query, sort=sort, page=page))
            TASKS.append(task)

        return await asyncio.gather(*TASKS)
NadieFiind commented 3 years ago

I did some testing to test which is faster. Obviously the ensure future is faster but it might hurt your network because it is making many requests at the same time.

This is my code:

import time
import asyncio
from NHentai.nhentai_async import NHentaiAsync

async def main():
    pages = 2
    nhentai = NHentaiAsync()

    print(f"Pages: {pages}")

    # test the speed of async generator
    start_time = time.time()
    async for page in nhentai.search_pages(query="a", max_pages=pages):
        pass
    print(f"Async Generator: {time.time() - start_time}")

    # test the speed of ensure future
    start_time = time.time()
    for page in await nhentai.list_search_pages(query="a", max_pages=pages):
        pass
    print(f"Ensure Future  : {time.time() - start_time}")

asyncio.run(main())

First, I requested 10 pages and then this happened:

Pages: 10
Async Generator: 9.494192361831665
Traceback (most recent call last):
  File "/home/nadie/MyFiles/Others/Playground/NHentai-API/run.py", line 21, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/nadie/MyFiles/Others/Playground/NHentai-API/run.py", line 17, in main
    for page in await nhentai.list_search_pages(query="a", max_pages=pages):
  File "/home/nadie/MyFiles/Others/Playground/NHentai-API/NHentai/nhentai_async.py", line 210, in list_search_pages
    return await asyncio.gather(*TASKS)
  File "/home/nadie/MyFiles/Others/Playground/NHentai-API/NHentai/nhentai_async.py", line 162, in search
    total_results = soup.find('div', id='content').find('h1').text.strip().split()[0]
AttributeError: 'NoneType' object has no attribute 'find'

Then, I tried it again but with a smaller number:

Pages: 2
Async Generator: 1.8263447284698486
Ensure Future  : 1.04958176612854

Pages: 2
Async Generator: 1.7395751476287842
Ensure Future  : 0.6928927898406982

As you can see it worked fine with smaller number of pages. I have no idea why the soup is returning None with bigger number of pages.

NadieFiind commented 3 years ago

I added a print in the NHentaiAsync.search method to see which soups are actually returning None and this is what I got:

Pages: 10
Not None
Not None
Not None
Not None
Not None
Not None
Not None
Not None
Not None
Not None
Async Generator: 9.730764865875244
None
None
None
Not None
Not None
Not None
Not None
Not None
Not None
Not None
Ensure Future  : 1.2536113262176514
AlexandreSenpai commented 3 years ago

I'll investigate why the search method sometimes are returning None. Thanks for the report.

I'm working to migrate the api from a common webscrapper to an nhentai api wrapper. It will improve the consistency of the methods result.

About the ensure_future vs async generator. I understood your points. I agree with you about ur changes to use the async generator. Let's continue with this strategy.

୧☉□☉୨

NadieFiind commented 3 years ago

I made a new commit. Please read the description. I played around with the concurrent_tasks argument. Under 7 concurrent tasks I don't get any None error. But with 7 and above concurrent tasks, that's where I start getting None error.