Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
The current Crawlee / StorageClientManager is more or less just copied from the Python SDK / StorageClientManager and is extremely simple. Its primary role is to maintain and provide access to storage client instances based on specific input parameters.
The Crawlee TS / StorageManager is more complex and it takes care of more things - creating instances of storages & their caching.
Currently, we have a helper module "creation_management" in storages/ which helps with it.
Let's move logic from storages/creation_management to StorageClientManager and improve the creation & caching process.
Functions get_or_create, find_or_create_client_by_id_or_name a create_*_from_directory should be refactored.
I could also imagine putting the functionality into a module instead of a singleton class, so basically StorageManager -> creation_management, not vice versa.
The current Crawlee / StorageClientManager is more or less just copied from the Python SDK / StorageClientManager and is extremely simple. Its primary role is to maintain and provide access to storage client instances based on specific input parameters.
The Crawlee TS / StorageManager is more complex and it takes care of more things - creating instances of storages & their caching.
Currently, we have a helper module "creation_management" in
storages/
which helps with it.Let's move logic from
storages/creation_management
toStorageClientManager
and improve the creation & caching process.Functions
get_or_create
,find_or_create_client_by_id_or_name
acreate_*_from_directory
should be refactored.