Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.65k stars 704 forks source link

Local API Error: `by_similarity` Chunking Strategy Not Recognized #3155

Closed eduardolundgren closed 3 months ago

eduardolundgren commented 3 months ago

Describe the bug When running the Docker image quay.io/unstructured-io/unstructured-api:0.0.68 locally, the chunking strategy by_similarity fails to work as expected.

To Reproduce Steps to reproduce the behavior:

  1. Start the Docker container with quay.io/unstructured-io/unstructured-api:0.0.68.
  2. Run the following curl command:
curl -X POST http://localhost:30000/general/v0/general \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H 'unstructured-api-key: <YOUR_API_KEY>' \
  -F 'files=@Sample.pdf' \
  -F 'strategy=auto' \
  -F 'chunking_strategy=by_similarity'

Expected behavior The API should process the request using the by_similarity chunking strategy without errors.

Actual behavior The API returns an error indicating that the chunking_strategy should be by_title:

{
  "detail": [
    {
      "type": "literal_error",
      "loc": ["body", "chunking_strategy"],
      "msg": "Input should be 'by_title'",
      "input": "by_similarity",
      "ctx": {"expected": "'by_title'"}
    }
  ]
}

Environment

Docker Image: quay.io/unstructured-io/unstructured-api:0.0.68 Local machine setup

Additional context This issue occurs consistently with the latest image. It seems that the API does not recognize the by_similarity chunking strategy, even though it is documented.

shreyanid commented 3 months ago

Hi @eduardolundgren ! The by_similarity chunking strategy is only available through the Unstructured hosted APIs, not locally (mentioned here at the top of the section of that strategy)