CodeForBc / airbnb-regulation

MIT License
4 stars 2 forks source link

Research Celery #42

Open immangat opened 2 months ago

immangat commented 2 months ago

Research celery to off load scrapy work load off of django.

immangat commented 1 day ago

Summary: Currently, Scrapy spiders are executed within a thread spawned by a Django endpoint. While this allows non-blocking execution, it ties the task's lifecycle to the Django process, leading to potential scalability and resource-management issues. To decouple the spider execution from Django, Celery should be used to manage Scrapy tasks in a dedicated worker process.

Tasks:

  1. Research:

    • Analyze how to configure Celery with Django and Redis as the message broker.
    • Investigate running Scrapy spiders asynchronously as Celery tasks.
    • Consider how to handle Scrapy logs and results effectively when executed in Celery.
  2. Implementation:

    • Set up Celery and Redis in the project.
    • Create a Celery task to run the Scrapy spider.
    • Refactor the Django endpoint to trigger the Celery task instead of running Scrapy in a thread.
  3. Testing and Validation:

    • Verify that the spider runs independently of the Django request/response cycle.
    • Test the Celery worker for concurrency and resource isolation.
    • Ensure logging and data persistence work seamlessly with Celery.
  4. Discussion with Team:

    • Evaluate whether adding Celery and Redis as dependencies is worth the complexity it introduces to the project.
    • Consider team familiarity with Celery, its long-term maintenance, and alternative approaches (e.g., a simpler task queue).
immangat commented 1 day ago

Rough draft of the flow

graph TD
    subgraph Django App
        A[HTTP Request] --> B[Celery Task Trigger]
    end

    B --> C[Redis Message Broker]
    C --> D[Celery Worker]
    D --> E[Harvester]
    E --> F[AirBnb Website]
    E --> G[PostgreSQL Database]
    F --> E