Open immangat opened 2 months ago
Summary: Currently, Scrapy spiders are executed within a thread spawned by a Django endpoint. While this allows non-blocking execution, it ties the task's lifecycle to the Django process, leading to potential scalability and resource-management issues. To decouple the spider execution from Django, Celery should be used to manage Scrapy tasks in a dedicated worker process.
Tasks:
Research:
Implementation:
Testing and Validation:
Discussion with Team:
Rough draft of the flow
graph TD
subgraph Django App
A[HTTP Request] --> B[Celery Task Trigger]
end
B --> C[Redis Message Broker]
C --> D[Celery Worker]
D --> E[Harvester]
E --> F[AirBnb Website]
E --> G[PostgreSQL Database]
F --> E
Research celery to off load scrapy work load off of django.