Borewit / strtok3

Promise based streaming tokenizer
MIT License
31 stars 14 forks source link

Error: End-Of-Stream #1073

Open vecerek opened 8 months ago

vecerek commented 8 months ago

Hi 👋

My project has the following dependencies:

dependency version
@tokenizer/s3 0.2.3
file-type 16.5.4

I've recently tried upgrading the file-type library to 18.5.0 after a migration to ESM. I've noticed an elevated error rate when validating files with the following error:

Error: End-Of-Stream
    at ReadStreamTokenizer.readBuffer (/app/node_modules/strtok3/lib/ReadStreamTokenizer.js:41:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ReadStreamTokenizer.ignore (/app/node_modules/strtok3/lib/ReadStreamTokenizer.js:89:31)
    at async _fromTokenizer (/app/node_modules/file-type/core.js:589:3)
    at async taskEither.tryCatch._typename (file:///app/packages/file-type-detector/dist/index.js:16:17)

I reverted the change, the error rate declined but I still see occasional End-Of-Stream errors. This is how I create the tokenizer:

import { makeTokenizer } from "@tokenizer/s3";

makeTokenizer(s3, { Bucket, Key, VersionId }, { disableChunked: true });

Is this the correct repository to report the issue in? Some alternatives I considered were https://github.com/Borewit/tokenizer-s3 as well as https://github.com/sindresorhus/file-type.

Borewit commented 8 months ago

Hi @vecerek.

Can you provide code demonstrating this problem?

bqp-articulate commented 4 months ago

I was running into something similar to the OP, however I think it was more of a misunderstanding on my part how peek-type operations actually work on streams in NodeJS, which is different than how they work in a language like python. Namely, that peeks must change the state of the streams being read, and once all the bytes are read, regardless of who is doing the reading, you can get an end-of-stream.

In my case, I was running into something like this related to peeking very small files. The file-type library, in one particular place, needs to peek a 512-byte signature using strtok3 and this consumes the whole stream if the stream was less than 512 bytes. While strtok3 and file-type give the correct output, their side effects need to be taken into account.

I made a quick test to show this behavior - I don't believe this to be a bug in strtok3 at all, it's more of a gotcha that one needs to be aware of. The second expect fails because peek causes the stream to be consumed:

    it('peekStream issue', async () => {

      const strtok3 =  await import('strtok3');

      const inputBuffer = Buffer.alloc(500, 1);
      const stream = Readable.from(inputBuffer, { objectMode: false });
      const tokenizer = await strtok3.fromStream(stream);

      expect(stream.readable).toBe(true);

      const outputBuffer = Buffer.alloc(1000, 10);
      let a = await tokenizer.peekBuffer(outputBuffer, {length: 1000, mayBeLess: true});
      let b = await tokenizer.peekBuffer(outputBuffer, {length: 1000, mayBeLess: true});
      let c = await tokenizer.peekBuffer(outputBuffer, {length: 1000, mayBeLess: true});

      expect(stream.readable).toBe(true);
    });

In order to avoid some of the issues this can cause, I've gotten in the habit of using separate pass-through streams when doing analysis alongside something like storage. For example, sampling a content-type of a file using a tokenizer prior to storing it.