apify / actor-templates

This project is the :house: home of Apify actor template projects to help users quickly get started.
https://apify.com/
25 stars 14 forks source link

Improve performance of the Python Actor templates #182

Closed vdusek closed 11 months ago

vdusek commented 1 year ago

In JS SDK we have AutoscaledPool for executing tasks in parallel. We don't have similar functionality in the Python SDK yet, however, it's planned in the upcoming months.

For now, users could just write some simple utility themselves, using asyncio.Queue to get what they need (https://docs.python.org/3/library/asyncio-queue.html#examples).

Writing AutoscaledPool might take a while. We could update our templates with some super simple parallelism using asyncio.Queue, it might take like 10 extra lines.

The issue is based on the Discord question:

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete,i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

vdusek commented 11 months ago

I did some initial experiments with the BeautifulSoup Template Actor, comparing the use of the synchronous requests library to the asynchronous httpx library. The Actor code and the input remained the same for both scenarios.

When using requests, the execution time was 43 seconds (Actor run - https://console.apify.com/actors/runs/CtmF9T6KRGC5FnJud#log).

When using httpx, the execution time improved to 26 seconds (Actor run - https://console.apify.com/actors/runs/VDcmt5tVlvmd9ree1#log).

Considering that our SDK is already asynchronous-only, there should be no problem in transitioning from the synchronous requests to the asynchronous httpx. So this will be a good first step.

vdusek commented 11 months ago

After a discussion with @B4nan we decided to keep the templates as simple as possible and not to add the "parallelism using asyncio.Queue". So the only optimization as part of this issue is the usage of HTTPX instead of Requests and further performance improvement will be achieved by implementing the AutoscaledPool to the Python SDK.