`AutoCorrectAnsiSubtitles` causes incorrect handling of files without UTF8-BOM

softworkz commented 10 months ago

The name AutoCorrectAnsiSubtitles is a bit misleading, because it doesn't do much "automatic" - it rather forces all files without UTF-8 BOM to be treated by ffmpeg as ASCII with a certain code page - even when they aren't. (btw. 'ANSI' is not quite right, because ANSI text is valid UTF-8 already, it's the extended ASCII chars which need special treatment)

It might make sense to check the content whether for whether it's UTF-8 before forcing a code page and setting sub_charenc. Here's how ffmpeg does it: https://github.com/FFmpeg/FFmpeg/blob/2532e832d2773d9d09574434f84decebcf6f81a1/libavcodec/decode.c#L950

lukasf commented 10 months ago

Let's clarify upfront that this only affects external subtitle files.

We have tested a lot of subtitle files, and we think that the current behavior is what will work best for most files. Most subtitle files which are UTF8 encoded do have the BOM. If there is no BOM, then most of the time the file is ANSI/CP, and applying the current CP instead of wrongly interpreting as UTF8 will help in the majority of cases.

You are free to set the config flag to false in your app, if you think that this is better for your use case.

It is not possible to reliably detect if a file is UTF8 or ANSI/CP, unless there is a BOM. The code you linked checks for invalid UTF8 code points. If such invalid code points are found, we know that the file is not UTF8. But if no invalid code points are found, it could still be both, UFT8 or ANSI/CP. Adding this would require us to parse the complete file upfront (parameter needs to be set before we start decoding), but we'd still not have a guaranteed result. So I don't see much improvement there. The current BOM check is done when the file is opened (first part is always read), so it does not involve any additional IO.

Still, if you want to add this parsing as a configurable option, feel free to create a PR.

Or maybe a different option would be having an overload for AddExternalSubtitle, where the code page could be explicitly set?

softworkz commented 10 months ago

The code you linked checks for invalid UTF8 code points. If such invalid code points are found, we know that the file is not UTF8. But if no invalid code points are found, it could still be both,

It is not possible to reliably detect if a file is UTF8 or ANSI/CP, unless there is a BOM

It is possible to detect this very reliably in fact. UTF-8 is designed in a way that typical use of extended ASCII chars is invalid in UTF-8, which means that the detection of invalid code points alongside a few additional checks is a successful strategy which is frequently used. Notepad++ for example does this, and I don't think I've ever a wrong detection. Also see here: https://github.com/AutoItConsulting/text-encoding-detect#utf-8-detection We're using portions of this library for charset detection - only at the server, though, not at any player yet - there it would be nice when it would be built-in into the playback mechanism to avoid needing to retrieve all (external) subtitle streams twice.

Still, if you want to add this parsing as a configurable option, feel free to create a PR.

If I find the time I'd probably submit it to ffmpeg directly, which would cover more of our cases.

I'm new to FFmpegInteropX (not to ffmpeg, though), so it was primarily annoying/confusing in combination with the option named AutoCorrectAnsiSubtitles :-)

It happened with the first test I ever made with FFmpegInteropX and external subtitles and the first two characters were wrong already. I had to follow all the way through the source code to understand what's going on and that AutoCorrectAnsiSubtitles does somewhat the opposite of what it promises by name.

Or maybe a different option would be having an overload for AddExternalSubtitle, where the code page could be explicitly set?

This requires knowing it up-front. When you don't know it, you need to load the stream beforehand - i.e. retrieve it twice, which might be ok for a single external subtitle track, but users often have 5, 10 or even more, and at least then it would become very inefficient. Not the processing of data - that's nothing (e.g. in relation to video), but the many requests and responses between client and server are killing.

An ideal approach IMO would be as follows:

An option AsciiCodePage` (default: "from system locale")
An option TextDecoding with values: (default: Automatic)
- Automatic
- UTF-8
- ASCII

UTF-8

No checks are performed
sub_charenc: utf8
sub_charenc_mode: FF_SUB_CHARENC_MODE_IGNORE

ASCII

No checks are performed
sub_charenc: [AsciiCodePage option]
sub_charenc_mode: FF_SUB_CHARENC_MODE_PRE_DECODER

Automatic

Check for BOM If present, do like UTF-8 above
Check content for UTF-8 If positive, do like UTF-8 above
Otherwise Do like ASCII above

lukasf commented 10 months ago

Interesting, thank you @softworkz

I am not an expert in text encodings. Maybe a better detection would be possible.

Still there is the issue that we'd need to do this character detection before opening the subtitle decoder in ffmpeg. Doing the BOM check is easy because the first bits of the file are alway read, so we do it inline with the first read call from ffmpeg. But if we want to read a bigger amount of data beforehand, we cannot do it like that. Plus, in streaming scenarios, we do not even have access to the underlying stream, so we could not do any character detection. This only works for local files. (I am a bit confused tbh, because you often speak about streaming, but this API only affects local playback. And if you do streaming, why not convert all subs to UTF8-BOM on the server?)

Indeed the best thing would be bringing this into ffmpeg itself, since it is the only really good solution, and it would work for local files as well as streaming. I tried multiple times to bring different things info ffmpeg, but not succeeded once. Their way of working (mailing lists) is not very inviting, if you are not into that. Still it could be worth a try, if this is really important for you.

softworkz commented 10 months ago

Let's start with simple bits:

And if you do streaming, why not convert all subs to UTF8-BOM on the server?

Because "we" (Emby btw...) don't modify our users' media files.

this only works for local files. (I am a bit confused tbh, because you often speak about streaming, but this API only affects local playback.

It is for external subtitles, and in the case I'm referring to, those external subtitles are located at the server next to the main media file. The server serves the main media file in some way and at the same time, it provides URLs for the external subtitles. From those URLs, I create a RandomAccessStream and call AddExternalSubtitleAsync to add it. Those files are small, which means that you can get one pretty quickly, so you can have the whole subtitle text even before playback starts. The UTF8 check shouldn't take more than a handful of milliseconds. I'm not familiar with RandomAccessStreams, but the name suggest that you might be able to read it before passing it to ffmpeg?

Indeed the best thing would be bringing this into ffmpeg itself, since it is the only really good solution, and it would work for local files as well as streaming

The cases vary. For actual streaming, we are not using this technique. Here we do HLS with VTT subtilte segments or in-stream subtitles, where we also mux external subtitles into the stream segments. Or we do burn-in at the server when the client doesn't support subtitles. For streaming we wouldn't need this external subtitle approach. But many users want DirectPlay, which is more a kind of remote file access (via http range requests). They want it because they want their files untouched and original (or don't have a string enough server for transcoding, and other reasons). And in this DirectPlay case, the only way to serve external subtitles is to serve them separately. DirectPlay is also one of the reasons for checking into FFmpegInteropX, because with transcoding, we can deliver everything in a way that the basic Windows media element can play it.

I tried multiple times to bring different things info ffmpeg, but not succeeded once.

It's horrible. I really hate it. But I got a number of submissions that made it. I also developed the D3D11 HW acceleration for QuickSync (long story - I allowed Intel to submit it, so you won't find me on those commits - but on others).

Their way of working (mailing lists) is not very inviting, if you are not into that.

I have set up a repository where you can create ffmpeg pull requests, which get automatically posted to the mailing list:

https://github.com/ffstaging/FFmpeg/

things info ffmpeg, but not succeeded once

This is a huge patchset which brings substantial new capabillities for subtitle handling into ffmpeg: https://github.com/ffstaging/FFmpeg/pull/18

But it failed (so far) due to some ridiculous nitpicking. Just disgusting..

lukasf commented 10 months ago

May I ask how you create the RandomAccessStream from the URI? The thing about RandomAccessStreams is that they are seekable, while web response streams are not seekable, so you cannot create one directly from a web response stream. It would be possible to roll a custom IRandomAccessStream implementation, which implements seeking by issueing a new web request (with range headers) for every seek.

It would of course be possible to just download the whole text sub into a memory stream and then use that as input. They are small enough that it does not really matter.

I'd like to add APIs using IInputStream instead of IRandomAccessStream, which would allow non-seekable streams to be used with our lib. But it is not there yet. But in that scenario, doing character encoding detection would be even more difficult, since we cannot just seek back after doing the detection.

Oh and good job about the subtitle PR. Really seems like a major improvement of the whole subtitle support in ffmpeg.

softworkz commented 10 months ago

Oh and good job about the subtitle PR. Really seems like a major improvement of the whole subtitle support in ffmpeg.

❤️

May I ask how you create the RandomAccessStream from the URI?

      var uri = new Uri(url);

      var streamRef = RandomAccessStreamReference.CreateFromUri(uri);
      var ras = await streamRef.OpenReadAsync();

      await ffmpegMss.AddExternalSubtitleAsync(ras, stream.DeliveryUrl);

I think nothing more is neded for this case...

lukasf commented 10 months ago

May I ask how you create the RandomAccessStream from the URI?
      var uri = new Uri(url);

      var streamRef = RandomAccessStreamReference.CreateFromUri(uri);
      var ras = await streamRef.OpenReadAsync();

      await ffmpegMss.AddExternalSubtitleAsync(ras, stream.DeliveryUrl);
I think nothing more is neded for this case...

Have you checked if this actually uses range requests on seek? For smaller files such as text subs it would be okay to just download the stream sequentially and keep in-memory for seeking support. But when using this for bigger files or complete videos, using range requests would be pretty important.

softworkz commented 10 months ago

I haven't checked, a colleague has done it, but as you said - for subtitles it doesn't matter. Even if it did, I suppose that a 100kB subtitle file would still be below the range size as it would be inefficient to use such small chunks. As far as I've seen, FFmpegInteropX is creating an IStream from the RandomAccessStream, and it should be possible to read it once (for UTF8 checking) and then set the position back to 0 before handing it over to ffmpeg, right?

lukasf commented 10 months ago

The lib you linked recommends to probe at least 4kb of data. We currently read files with a fixed chunk size of 16kb (configurable, but 16kb is the default). I wonder if it would be enough to check these first 16kb of data. Then we could do it inline with the first read call coming from ffmpeg. The obvious upside is that we would not have to do any seeking at all. We'd just have to make sure that we really fully read that first chunk (the stream might return less than what is requested, if it comes from the web). The second upside is this: We don't really know what kind of stream is passed to our function. It is very well possible to pass the stream of a full movie here. It would get parsed and all subtitle streams would be added. We sure do not want to read the full movie in-memory, just to do some character encoding check. So we'd definitely need to restrict the amount of data to check.

brabebhin commented 10 months ago

That way of getting IRandomAccessStream does not guarantee seek is supported.

softworkz commented 10 months ago

The lib you linked recommends to probe at least 4kb of data. We currently read files with a fixed chunk size of 16kb (configurable, but 16kb is the default). I wonder if it would be enough to check these first 16kb of data.

I would say, even less than 4kB it is fine, for the reason that we are dealing with spoken language- and spoken language usually involves above-average use of diacritics and extended ASCII chars (unlike for example a shipping address database of a US company, where maybe just one-out-of-thousand records could carry an international shipping address including extended ASCII chars.

It is very well possible to pass the stream of a full movie here. It would get parsed and all subtitle streams would be added. We sure do not want to read the full movie in-memory, just to do some character encoding check. So we'd definitely need to restrict the amount of data to check.

A limit makes sense - definitely. But what might probably come to a surprise is that subtitles are rarely muxed and interleaved into media files like video or audio. In case of MKV containers for example, text subtitles like SRT or ASS are included as a contiguous section somewhere at the start or end of the container. Similar like for the seek index, the muxer either reserves some space in the file right at the the start, or otherwise it gets appended to the end of the file when finalizing. But of course other cases exist - like DVB graphical subtitles which are included in the mux at presentation time (typically 6-8s ahead of time). But those wouldn't be

It is very well possible to pass the stream of a full movie here. It would get parsed and all subtitle streams would be added.

Hm... - these are quite different cases. In one case, you are adding an additional "input file", which will will be demuxed and can have multiple streams of whatever kind. When adding a steam directly instead, there's no container demuxing and the processing needs to rely on the attributes of the stream or defaults) to determine its type and configuration.

softworkz commented 10 months ago

That way of getting IRandomAccessStream does not guarantee seek is supported.

Not even if you read the stream once (full or partially) and then reset the stream position and only afterwards do the conversion to a COM stream (IStream)?

brabebhin commented 10 months ago

I don't think seeking back is necessary. We just read whatever amount we want, perform inplace check, then pass on the data to ffmpeg. Weather we read the whole stream or just a chunk shouldn't make much of a difference in implementation.

softworkz commented 10 months ago

Ah, we don't give the stream to ffmpeg? But a buffer instead? At that point I had stopped reading the code. My last thought was that when you explicitly wrap the ras to an IStream, it's that IStream which goes into ffmpeg - no?

brabebhin commented 10 months ago

Ffmpeg can't take the IStream directly, there's a read method that we use to take data from the stream and hand it over to ffmpeg. And that method is under our control. We technically could just make up data from any source, not just an IStream.

lukasf commented 10 months ago

We cannot easily read more than the chunk size, because in the read method, we get a preallocated buffer from ffmpeg where we put the data in. If we'd need to read more than fits in the buffer, we would need to store the additional data somewhere and pass it during the next read call(s). That would be quite messy. I'd really like to avoid that. We should only read the first chunk, do the probing there and pass the chunk on to ffmpeg.

lukasf commented 10 months ago

I created a PR where the probing is done of the first chunk. I have not done any real testing. Would be great if you could test the branch with the files you had @softworkz, to get a first indication if this solves the issues.

We should probably create a few test subs with different encoding, and add a unit tests where we check the resulting subtitle stream.

softworkz commented 10 months ago

Thanks a lot for the PR, will apply and report back in a moment!

softworkz commented 10 months ago

Thanks again for the quick turnaround, I've added a review to the PR (https://github.com/ffmpeginteropx/FFmpegInteropX/pull/369#pullrequestreview-1711723012).

I can confirm that it is working fine. I tried with three subtitle files: TestSubs.zip

DA is UTF-8 with BOM EN is UTF-8 without BOM DE is 8bit with codepage 1252 (default Windows Western Europe)

ffmpeginteropx / FFmpegInteropX