Option to avoid parsing entire Matroska file?

hvianna commented 5 months ago

Hello!

I'm having some issues when trying to retrieve the metadata of a large (15GB) video file with parseBlob() - disk usage skyrockets and it takes about 1 minute and 20 seconds to resolve with the metadata, so it looks like the it's parsing the entire file.

Sometimes the browser just crash or I get an out of memory error (having the dev tools open seems to make things worse / slower).

I tried using skipPostHeaders: true and duration: false, but it seems parseBlob() doesn't take an options object.

I'd appreciate any advice.

Kind regards.

hvianna commented 5 months ago

Update:

fetchFromUrl( url, { skipPostHeaders: true } ) also doesn't seem to prevent it from reading the entire file until it returns the metadata. At least for this particular file, which is an .mkv with an AVC video track and two audio tracks (DTS and PCM).

Borewit commented 1 month ago

Does music-metadata v9.0.0 solve you issue?

The implementation of reading from Blobs have been changed from buffering to streaming.

hvianna commented 1 month ago

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

Also, do I still need a buffer polyfill for the browser? If I remove it, I can only retrieve metadata from flac files, everything else gives me the error below:

I'm testing with the following code:

// for web files (URLs)
const response = await fetch( uri );
const metadata = await parseWebStream( response.body, response.headers.get('content-type'), { skipPostHeaders: true } );

// for FileSystem API files
const file = await handle.getFile();
const metadata = await parseBlob( file );

Thanks.

pcbowers commented 1 month ago

@Borewit Unless I'm missing something, it looks like parseWebStream is not being exported and thus cannot be used: https://github.com/Borewit/music-metadata/blob/v9.0.0/lib/index.ts#L11.

Furthermore, on use of this code:

const response = await fetch(`https://my/mp3/file`);
const metadata = await parseWebStream(response.body!, response.headers.get('content-type')!, {
  skipPostHeaders: true,
  includeChapters: true,
  skipCovers: true
});

I get this error:

TypeError [ERR_INVALID_ARG_VALUE]: The argument 'stream' must be a byte stream. Received ReadableStream { locked: false, state: 'readable', supportsBYOB: false }
    at new NodeError (node:internal/errors:405:5)
    at setupReadableStreamBYOBReader (node:internal/webstreams/readablestream:2155:11)
    at new ReadableStreamBYOBReader (node:internal/webstreams/readablestream:916:5)
    at ReadableStream.getReader (node:internal/webstreams/readablestream:352:12)
    at new WebStreamReader (file:///home/pcbowers/projects/hono/node_modules/.pnpm/peek-readable@5.1.1/node_modules/peek-readable/lib/WebStreamReader.js:12:30)
    at Module.fromWebStream (file:///home/pcbowers/projects/hono/node_modules/.pnpm/strtok3@7.1.0/node_modules/strtok3/lib/core.js:25:36)
    at Module.parseWebStream (file:///home/pcbowers/projects/hono/node_modules/.pnpm/music-metadata@9.0.0/node_modules/music-metadata/lib/core.js:29:39)
    at Array.eval (/home/pcbowers/projects/hono/src/index.ts:12:48)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async getRequestListener.overrideGlobalObjects (file:///home/pcbowers/projects/hono/node_modules/.pnpm/@hono+vite-dev-server@0.13.0_hono@4.4.13/node_modules/@hono/vite-dev-server/dist/dev-server.js:69:32) {
  code: 'ERR_INVALID_ARG_VALUE'

I wish I knew more about it or else I would have debugged further! Leaving this here instead of on a new issue since I think fixing this would solve "avoid parsing entire file"

Borewit commented 1 month ago

~~Please put https://github.com/Borewit/music-metadata/issues/2135#issuecomment-2224504220 as a new issue @pcbowers , it is unrelated.~~

Moved https://github.com/Borewit/music-metadata/issues/2135#issuecomment-2224504220 to issue #2143

Borewit commented 1 month ago

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

That was bad, do you mind giving it a try with v9.0.1 @hvianna ?

hvianna commented 1 month ago

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

That was bad, do you mind giving it a try with v9.0.1 @hvianna ?

It works fine for flac and mp3, no more Buffer-related errors.

I'm still getting errors for webm and mkv, though.

using parseWebStream():

TypeError: Cannot read properties of undefined (reading 'docType')
    at MatroskaParser.parse (MatroskaParser.js:50:68)
    at async parse (ParserFactory.js:57:5)
    at async retrieveMetadata (index.js:3172:17)

using parseBlob():

Error: End-Of-Stream
    at ReadStreamTokenizer.readBuffer (ReadStreamTokenizer.js:44:19)
    at async MatroskaParser.readBuffer (MatroskaParser.js:221:9)
    at async MatroskaParser.parseContainer (MatroskaParser.js:151:39)
    at async MatroskaParser.parseContainer (MatroskaParser.js:139:33)
    at async MatroskaParser.parseContainer (MatroskaParser.js:139:33)
    at async MatroskaParser.parse (MatroskaParser.js:49:26)
    at async parse (ParserFactory.js:57:5)
    at async retrieveMetadata (index.js:3175:17)

Borewit commented 1 month ago

Parse 'parseBlob()' is calling parseWebStream() internally, so it is weird you have inconsistent results.

https://github.com/Borewit/music-metadata/blob/d6c275509df2567d23f5ff73fa08bf10cac30986/lib/core.ts#L23-L29

Do you experience the same issues here?: https://audio-tag-analyzer.netlify.app/

hvianna commented 1 month ago

Do you experience the same issues here?: https://audio-tag-analyzer.netlify.app/

Yes, same error. I tried with a few video formats (webm, mkv, mp4)..

Fileinfo of one of them:

General
Complete name                            : W:\DIY - Tips & Tricks - Tips in life.mp4
Format                                   : MPEG-4
Format profile                           : Base Media
Codec ID                                 : isom (isom/iso2/avc1/mp41)
File size                                : 24.9 MiB
Duration                                 : 4 min 11 s
Overall bit rate                         : 828 kb/s
Frame rate                               : 30.000 FPS
Writing application                      : Lavf58.29.100

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : High@L3.1
Format settings                          : CABAC / 5 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 5 frames
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 4 min 11 s
Bit rate                                 : 692 kb/s
Width                                    : 576 pixels
Height                                   : 1 024 pixels
Display aspect ratio                     : 0.562
Frame rate mode                          : Constant
Frame rate                               : 30.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.039
Stream size                              : 20.8 MiB (84%)
Title                                    : Twitter-vork muxer
Writing library                          : x264 core 164 r3095 baee400
Encoding settings                        : cabac=1 / ref=5 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=2 / psy=0 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=0 / threads=4 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / stitchable=1 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=infinite / keyint_min=30 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=28.0 / qcomp=0.60 / qpmin=10 / qpmax=69 / qpstep=4 / vbv_maxrate=2048 / vbv_bufsize=2048 / crf_max=0.0 / nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=2:1.00
Codec configuration box                  : avcC

Audio
ID                                       : 2
Format                                   : AAC LC
Format/Info                              : Advanced Audio Codec Low Complexity
Codec ID                                 : mp4a-40-2
Duration                                 : 4 min 11 s
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Channel layout                           : L R
Sampling rate                            : 44.1 kHz
Frame rate                               : 43.066 FPS (1024 SPF)
Compression mode                         : Lossy
Stream size                              : 3.84 MiB (15%)
Title                                    : Twitter-vork muxer
Default                                  : Yes
Alternate group                          : 1

Borewit commented 1 month ago

I managed to get an end-of-stream exception as well, parsing an MP4 file.

Issue may be caused by https://github.com/Borewit/peek-readable/blob/master/lib/WebStreamReader.ts

Not something I can resolve quickly.

hvianna commented 1 month ago

No problem, thanks for investigating this.

In the meantime, I'll keep testing it with more audio files. I love the fact that my bundle size has decreased around 100 kB with the new music-metadata, compared to the latest music-metadata-browser. Awesome job!

hvianna commented 1 month ago

I did some testing with music-metadata v9.0.3 and this is what I got:

file size	container	audio streams	time to resolve
2.3 GB	mp4	aac	12 s
4.3 GB	mkv	ac3 + dts	24 s
15 GB	mkv	dts + pcm	80 s
17 GB	mkv	pcm	99 s

It still reads the entire file, even with { skipPostHeaders: true } in the options, or if I set fileInfo.size to a small value.

I'm not sure if this can be avoided at all, since I don't think you can skip to a random position in the stream (without reading all the data up to that point sequentially).

Borewit commented 1 month ago

The atom based format parser, MP4Parser and MatroskaParser are not changing their behavior on any of the flags.

Changing the file size, will impact the container format read. Depends on the structure of file is that has an impact, the length of the nested atoms will usually override the parent atom / container size.

There are a few approaches possible to get your metadata result faster:

1: Read only a portion of the stream Currently not implemented, but we could add an option to the parser to read as little as possible. Challenge with atom based format is, that is not guaranteed metadata atoms appear first. Neither it so straightforward to understand at which point in the stream (at which atom) we got all or most metadata.

I don't think you can skip to a random position

No, that is not directly possible. But... the underlying token architecture (see dependencies), is designed that if the underlying file access does support skipping to a random position, that can be utilized, which brings us to option:

2: Utilize the tokenizer You cannot skip in a stream, but it possible to read your file in smaller sub-streams using @tokenizer/http. Requires your web back-end to support HTTP(S) RFC-7233 range request. The file format read, plus the network delay, determine of this method is more efficient, or even slower then read the file as a normal stream.

3: Get early access to the metadata With the observer option in option, you can receive a notification when the metadata is updates. Strictly speaking this makes parsing of the file even slightly slower, but it allows you have results as soon as the metadata has been read.

Borewit commented 1 week ago

In PR #2213 I am working towards asynchronous parsing of Matroska, instead of extracting metadata from the full tree. I hope to be able parse less elements, to speed up the overall process.

hvianna commented 1 week ago

@Borewit Thanks for the update, much appreciated!

Borewit commented 1 week ago

It is very tricky, looks like not all metadata is necessary at the beginning of the file.

For video a 1 GB remote (on WS S3 cloud) video file, I could bring the the parsing time from 45 seconds to 500 ms, by quieting after receiving the first segment/cluster element.

With that hack, other Matroska files fail, as they have metadata further on on the file.

With partial read support, there are possible optimizations to be made. There are certainly elements I did parse, which are not even used. I flagged a bunch of them to be ignored, but it does not do magic. The elements I am interested in are sometimes on the same level as (many) elements I am not interested in. So it is hard to efficiently seek in the file.

Borewit commented 1 week ago

I managed to skip multiple segment/cluster at once, using the SeekHead index (ref). Implementation in: #2219

I was able to parse a 1 GB remote (on WS S3 cloud) video file, in 600 ms. I do not expect any improvement on a flat stream, you need a seek-able medium.

Borewit / music-metadata

Option to avoid parsing entire Matroska file? #2135