Unstructured-IO / unstructured-js-client

A Typescript client for the Unstructured hosted API
MIT License
40 stars 12 forks source link

splitPdfConcurrencyLevel causes "Error in chunkDocument TypeError: Body is unusable" #96

Closed hubert-rutkowski85 closed 1 month ago

hubert-rutkowski85 commented 3 months ago

One user reported a problem: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1720447185282449

image

The problems occurs when we have splitPdfPage: true and splitPdfConcurrencyLevel higher than 1 => when splitting documents.

It works in a stable way with splitPdfPage: false or splitPdfConcurrencyLevel: 1

At beginning the suspicion was on older node version, but after switching to node 21 it persists. User tried on linux (ubuntu) and Macos. Unstructured have tried to repro it using files from user, but failed to get consistent results. Perhaps the bug is intermittent and it's not always appearing. Attaching the files and code sample.

const partitions = await this.client.general.partition({
      partitionParameters: {
        languages: ['fr'],
        files: {
          content: buffer,
          fileName: filename,
        },
        chunkingStrategy: ChunkingStrategy.ByTitle,
        combineUnderNChars: 100,
        maxCharacters: 3000,
        newAfterNChars: 2500,

        splitPdfPage: true,
        splitPdfConcurrencyLevel: 10,
        strategy: Strategy.Auto,
      },
    });

554504_RC_00_M2023_013_AUDITS_ENERGETIQUES_RC (1).pdf 00_M2023_013_AUDITS_ENERGETIQUES_RC (1).pdf M2023_013_AUDITS_ENERGETIQUES_CCP.pdf

awalker4 commented 1 month ago

Looks like #119 is the fix for this