Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

bug/file timeout when partition #3727

Open AustinZzx opened 1 month ago

AustinZzx commented 1 month ago

Describe the bug When partitioning the following file, it simply times out.

To Reproduce

  const unstructuredClient = new UnstructuredClient({
      security: {
        apiKeyAuth: process.env.UNSTRUCTURED_IO_API_KEY,
      },
    });
    const res: PartitionResponse = await unstructuredClient.general.partition(
      {
        partitionParameters: {
          files: {
            content: content,
            fileName: fileName,
          },
          splitPdfPage: true,
          splitPdfAllowFailed: true,
          splitPdfConcurrencyLevel: 15,
          strategy: Strategy.Auto,
          languages: ["eng"],
          chunkingStrategy: ChunkingStrategy.ByTitle,
          maxCharacters: 8000, // our embedding model max token is 8192
        },
      },
      {
        retries: {
          strategy: "backoff",
          backoff: {
            initialInterval: 500,
            maxInterval: 60000,
            exponent: 1.5,
            maxElapsedTime: 900000,
          },
          retryConnectionErrors: true,
        },
        timeoutMs: 10 * 60 * 1000, // 10 mins
      },
    );

    if (res.statusCode !== 200) {
      throw res.rawResponse;
    }

Expected behavior Expect the partition to succeed

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info run in nodejs, using unstructured serverless

Additional context the file that triggers the error has been attached here. [Uploading error.xlsx…]()