Research Celery - Githubissues

immangat commented 2 months ago

Research celery to off load scrapy work load off of django.

immangat commented 1 day ago

Summary: Currently, Scrapy spiders are executed within a thread spawned by a Django endpoint. While this allows non-blocking execution, it ties the task's lifecycle to the Django process, leading to potential scalability and resource-management issues. To decouple the spider execution from Django, Celery should be used to manage Scrapy tasks in a dedicated worker process.

Tasks:

Research:
- Analyze how to configure Celery with Django and Redis as the message broker.
- Investigate running Scrapy spiders asynchronously as Celery tasks.
- Consider how to handle Scrapy logs and results effectively when executed in Celery.
Implementation:
- Set up Celery and Redis in the project.
- Create a Celery task to run the Scrapy spider.
- Refactor the Django endpoint to trigger the Celery task instead of running Scrapy in a thread.
Testing and Validation:
- Verify that the spider runs independently of the Django request/response cycle.
- Test the Celery worker for concurrency and resource isolation.
- Ensure logging and data persistence work seamlessly with Celery.
Discussion with Team:
- Evaluate whether adding Celery and Redis as dependencies is worth the complexity it introduces to the project.
- Consider team familiarity with Celery, its long-term maintenance, and alternative approaches (e.g., a simpler task queue).

immangat commented 1 day ago

Rough draft of the flow

graph TD
    subgraph Django App
        A[HTTP Request] --> B[Celery Task Trigger]
    end

    B --> C[Redis Message Broker]
    C --> D[Celery Worker]
    D --> E[Harvester]
    E --> F[AirBnb Website]
    E --> G[PostgreSQL Database]
    F --> E

CodeForBc / airbnb-regulation

Research Celery #42