API For Web Scraping / Processing

solaris007 commented 7 months ago

In order to integrate the PoC-style Content Scraper and Content Processor an HTTP API is needed providing the following features:

trigger an async scraping -> processing task, which will have the content-scraper scrape content off the input URL, store the results and forward the task to the content-processor
check the status of a triggered task and eventually get the results of the processor stages/handlers

Here is a proposal for amending the HTTP API spec:

openapi: 3.0.0
info:
  title: Web Scraping and Processing API
  version: 1.0.0
paths:
  /scrape:
    post:
      summary: Initiates a web scraping job.
      description: Triggers a new scraping job for the given URL and returns a task ID for status polling.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                url:
                  type: string
                  format: uri
                  description: The URL to be scraped.
              required:
                - url
            examples:
              example-1:
                value: { "url": "https://example.com" }
      responses:
        202:
          description: Accepted. The scraping job is initiated, and a task ID is returned.
          content:
            application/json:
              schema:
                type: object
                properties:
                  taskId:
                    type: string
                    description: The unique identifier for the scraping task.
              examples:
                example-1:
                  value: { "taskId": "12345" }
        400:
          description: Bad Request. The URL is invalid or missing.
        429:
          description: Too Many Requests. Rate limit exceeded.
        500:
          description: Internal Server Error.

  /scrape/{taskId}:
    get:
      summary: Polls the status and results of a scraping job.
      description: Retrieves the status and, if available, the results of a scraping job by task ID.
      parameters:
        - in: path
          name: taskId
          required: true
          schema:
            type: string
          description: The unique identifier for the scraping task.
      responses:
        200:
          description: OK. Returns the status of the scraping job and results if completed.
          content:
            application/json:
              schema:
                type: object
                properties:
                  status:
                    type: string
                    description: The current status of the job ('pending', 'in_progress', 'completed', 'failed').
                  results:
                    type: object
                    properties:
                      translation:
                        type: string
                        description: URL or location of the translation result.
                      seoKeywords:
                        type: string
                        description: URL or location of the SEO keyword extraction result.
                      sentimentAnalysis:
                        type: string
                        description: URL or location of the sentiment analysis result.
                    required: []
              examples:
                pending:
                  value:
                    status: "pending"
                completed:
                  value:
                    status: "completed"
                    results:
                      translation: "https://results.example.com/translation/12345"
                      seoKeywords: "https://results.example.com/seo/12345"
                      sentimentAnalysis: "https://results.example.com/sentiment/12345"
        404:
          description: Not Found. The task ID does not exist.
        429:
          description: Too Many Requests. Rate limit exceeded.
        500:
          description: Internal Server Error.

solaris007 commented 7 months ago

@iuliag @ekremney @dzehnder @AndreiAlexandruParaschiv @alinarublea please review / provide input

iuliag commented 7 months ago

For my understanding: the url is a page URL that has nothing to do with the sites we have in StarCatalogue?

The Location header in the post response could contain the url to poll for the status of the task.

The current status of the job ('pending', 'in_progress', 'completed', 'failed').

If the possible status values are known, we should use an enum. What's the difference between 'pending' and 'in_progress'? When is the task completed, after all the subtasks are completed? Would you show partial results as the subtask complete, or just final results? It would be good to have an example of the response body for failed as well. Generally, I think you'd have a different schema for the response body depending on status (or state), with different required properties.

For 429, it should include the Retry-After header.

adobe / spacecat-api-service

API For Web Scraping / Processing #182