apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.04k stars 259 forks source link

Improve storage manager and merge it with `creation_management` module #147

Open vdusek opened 4 months ago

vdusek commented 4 months ago

The current Crawlee / StorageClientManager is more or less just copied from the Python SDK / StorageClientManager and is extremely simple. Its primary role is to maintain and provide access to storage client instances based on specific input parameters.

The Crawlee TS / StorageManager is more complex and it takes care of more things - creating instances of storages & their caching.

Currently, we have a helper module "creation_management" in storages/ which helps with it.

Let's move logic from storages/creation_management to StorageClientManager and improve the creation & caching process.

Functions get_or_create, find_or_create_client_by_id_or_name a create_*_from_directory should be refactored.

janbuchar commented 4 months ago

I could also imagine putting the functionality into a module instead of a singleton class, so basically StorageManager -> creation_management, not vice versa.

janbuchar commented 4 months ago

This code should not check the implementation in use - it's a generic storage manager that should not be concerned with the concrete implementation.