Error thrown for unsupported segment type

sssanwar commented 1 month ago

I ran the lib against the attached sample file and it threw an error. Is it possible to skip the invalid segments or display sets with errors and continue on? I tested with bdsup2sub, and it could handle the errors and continue on to the end. Last but not least, thanks for the great work. sample.sup.zip

sssanwar commented 1 month ago

Just a note on the sample file. It came from 3D source in Over Under format, so the subtitles would be displayed at the top and the bottom - which is to be expected. Just in case you wonder why.

Arcus92 commented 1 month ago

Hi @sssanwar, Thanks for reporting this file. May I ask you what application generated this file? I've tested it with SubtitleEdit and bdsup2subpp and debugged the binary data. It looks like this PGS file is indeed corrupted.

bdsup2subpp report:

ODS ofs:0x0002ac19, size:0xfff0, img size: 994*674
WARNING: <unknown> 0x50 ofs: 0x0003ac16

libpgs-js reports:

Unsupported segment type 80

0x50 in hexadecimal is equal to 80 in decimal. This occurs after the 10th display set is read. SubtitleEdit just stops after the 10th subtitle.

The size of the ODS segment in the 11th set is set to 0xFFF0. This is not the correct segment size but somehow an random magic number? Unfortunately, I cannot skip this broken segment, because the segment size itself is corrupted. To skip a segment I need to know the length and jump to a position after the segment. With an invalid size I skip into a random part of the file and reading gibberish. In this case a random 0x50 which isn't a valid segment type to read.

Best I can do is adding a way to ignore critical errors and abort reading. But you won't be able to read past the error (10th display set).

sssanwar commented 1 month ago

Not sure what app was used to generate that. Got it from the internet. :) My guess is they just chopped the file without paying attention to the display set boundaries. With bdsup2sub, I was able to retrieve 3 frames without crashing. I think discarding the rest with errors is OK. At least it's not crashing, just doing best effort. frames_sample.zip

sssanwar commented 1 month ago

Just FYI, I made small changes to the 0.4.0 code and I was able to retrieve 17 cues from the same sample file. I haven't tested it with the latest PR#11.

Rather than throwing error in displaySet, I set it to continue.

if (magicNumber != 0x5047) continue

which advances the reader position by 1 while skipping the error part.

Then I put a try-catch block at the internal renderer:

    while (reader.position < reader.length) {
      const displaySet = new DisplaySet()
      try {
        displaySet.read(reader, true)
        this.displaySets.push(displaySet)
        this.updateTimestamps.push(displaySet.presentationTimestamp)
      } catch (e: any) {
        errorMessages.push(e.message)
      }
    }

Perhaps not the most elegant way, but it might give you some idea.

Arcus92 commented 1 month ago

The suggested fix just brute-forces through the stream until it finds a valid magic number. Also it skips two bytes for every magic number read, so you could accidentally skip PGS data if you are offset by one byte.

Since the output is the same as in SubtitleEdit, FFmpeg and bdsup2subpp I guess we can say that the file itself is broken and non-valid PGS data. The library shouldn't be dealing with fixing corrupted subtitle files. Accounting every possible error costs a lot of operations and a lot of guess work at runtime. I try to make it as performant as possible for low end devices. I agree on having some useful error messages and not abruptly crashing the worker thread, so I'm adding some error handling there - only adds a small overhead. But I think this lib is allowed to fail if the data seams broken. You should only load valid PGS data into it.

You could create a tool that scans for the PGS magic number and tries to restores the segments and write a valid PGS file afterwards. But this shouldn't happen every time at runtime. As a challenge, I might look into that file and try to fix it. Maybe I can discover what went wrong when creating it.

sssanwar commented 1 month ago

The suggested fix just brute-forces through the stream until it finds a valid magic number. Also it skips two bytes for every magic number read, so you could accidentally skip PGS data if you are offset by one byte.

That's true. As mentioned before, it's best effort with no guarantee of recovering from errors.

But I think this lib is allowed to fail if the data seams broken. You should only load valid PGS data into it.

I think it is a difference between a fault-tolerant system vs a system with little fault tolerance. It boils down to our own system design philosophy. I'd probably make the same argument as yours if I were you. :)

Accounting every possible error costs a lot of operations and a lot of guess work at runtime.

Of course this is not possible. The method I used was actually very simple, and doesn't require extra overhead. The lib is still required to read the file byte by byte regardless of whether there is corrupted data or not.

Thanks again for making this great library, and making it open source. I can modify it according to my needs.

Arcus92 / libpgs-js

Error thrown for unsupported segment type #10