QBobWatson / python-ebml

Pure python Matroska / EBML parser
GNU General Public License v3.0
13 stars 7 forks source link

Implementation readiness #1

Open pannal opened 7 years ago

pannal commented 7 years ago

Hey,

I'm the author of the Sub-Zero subtitles plugin for the Plex media server environment.

For version 2 I'd like to have an extract-embedded-srt/ass/sea functionality and I've come across your repository.

How implementation ready is your library? I've only just found it and haven't had an in depth source reading yet. Is it already capable of extracting text tracks?

Thank you in advance

QBobWatson commented 7 years ago

Hi pannal,

The library works great for what it does -- it can add and reorganize all the auxiliary data (chapters, attachments, tags, seek entries, etc) without moving or touching the stream data. Unfortunately, subtitles are part of the stream data, and it would be incredibly slow to extract them using pure Python.

For subtitle extracting, I just use mkvextract. It will save your subtitles in .sup, .sub/.idx, .srt, or whatever format they're encoded in. Beware that it does take a few minutes to extract .sup tracks from a BluRay, for instance. I did write some Python/C code for decoding and displaying the extracted subtitles, if you're interested in those.

pannal commented 7 years ago

Hey, thank you for for answer.

Unfortunately I'm somewhat bound to python only implementations, as the Plex plugin can't rely on binary dependencies (or I'd have to bundle binaries for every available system architecture out there).

Also text based subtitles are my only goal. Are those streams really that hard to extract? I always thought of them as some kind of attachment, but obviously I was wrong about that.

I was under the impression that an index of some kind held the offsets to those non audio/video tracks, which, in my narrow view, should mean a simple partial binary read of a very big file.

Thank you again

QBobWatson commented 7 years ago

IIUC, text-only subtitles are multiplexed like any other kind. See this link.

You could try using python-ebml to extract them. Parsing for Blocks and Clusters and other stream Elements is rudimentary, but should be easy enough to extend. Generally though, pure python code is about an order of magnitude slower than C in tight loops, and it will take mkvextract a good 5 minutes to process a 30GB BluRay. So I suspect what you end up with will be too slow to be usable.

pannal commented 7 years ago

Hmm. Sub-Zero has a background task system - that could work.

I'm honestly thinking about a configurable path to mkvextract right now, though. Perhaps in addition to a pure python implementation.

Your library still looks quite nicely implemented to me. SZ is currently using enzyme to get mkv file metadata. Your approach may be more complete than enzyme, though. I'll. Have to sort that out.

Thank you for your work! May I ask where you're using this implementation?

pannal commented 7 years ago

Enzyme: https://github.com/Diaoul/enzyme

QBobWatson commented 7 years ago

Thanks! I wrote the library for extracting and editing tags, track information, attachments, chapters, and all that. I only use it in my (obsessively over-coded) scripts for managing metadata in my own movie collection. If you need examples, I can send you some of that code.

seitzbg commented 7 years ago

Examples of how you use this library would be very cool!