SRT lines should not be limited to 32 characters

benjaminbellamy commented 1 year ago

In https://github.com/Podcastindex-org/podcast-namespace/blob/main/transcripts/transcripts.md#srt it says Max characters per line: 32. This should be a best practice, not a requirement.

tomrossi7 commented 1 year ago

Are you suggesting that we don't have a max characters per line? Would that make it hard for players to know what to expect?

ryan-lp commented 1 year ago

32 max seems like quite a reasonable hard limit for narrow mobile phone screens, especially since podcast apps are typically viewed in portrait mode. Additional characters over 32 very quickly require noticeably smaller font sizes. 32 is also the hard character limit for Teletext on NTSC (40 characters for PAL), and these numbers are also found in various SRT guidelines, with the wider ones generally being no problem for television which uses a wider aspect ratio. Related issue: #370

Also, consider that the JSON transcript format suffers because it did not specify a limit for segment size. Players really don't know what to expect, and that makes it a less useful format since players don't know how to render it. Related issue: #366

benjaminbellamy commented 1 year ago

I am suggesting that the 32 character limit should be optional. It is less relevant for a podcast than for a video since you have more space on the screen. 32 is reallllllly low. As soon as you get @daveajones and Adam talking together you get a stalk overflow. ;-) Keep also in mind that few characters mean shorter display time. If you talk fast it's going to flicker. As said in #370 the actual width depends on the characters. 32 should be a goal, not mandatory.

ryan-lp commented 1 year ago

I have just finished listening to PC2.0 108, and it appears that you are just having trouble figuring out how to split up the lines to fit inside a specified character limit. If that is the case, I can easily help provide guidance on how to write the code. There is no need to change a good standard because of that, and usability guidelines should not be changed because a programmer can't figure out how to output subtitles in the specified format. If you are doing what I think you're doing, and you simply don't want to change the default code, then your lines would be on average 64 characters wide, and that would be disastrous for the people who actually rely on the captions. In theory, the tool you are using has a worst case of upwards of 500 characters per line (in the case where the segment extends to the upper limit of 30 seconds, and the speaker is speaking at 200 wpm). If hypothetically your proposal were the standard, lazy programmers would flock to this approach and the subtitles would be unreadable for the people who genuinely need them. 32 characters is not "really low", unless you have really good eye sight, but these standards are defined for a general population which includes people who rely on subtitles without necessarily having as good eyesight as yourself. Although as I said, I suspect you are justifying the change more because of the programming challenge than for the benefit of the people who need the captions.

As someone who actually reads the podcast subtitles, I am affirming that the character limit is extremely important. I have experimented with increasing the number of characters, and it very quickly does become unreadable on a narrow mobile phone screen. I know this from putting in the effort to write code rendering subtitles of different character limits, and 32 is very reasonable and in line with various other pre-existing standards. For Chinese and Japanese, the standard is typically to use an exact half of that. #370 does not by any means support your case, it actually stresses the importance of character limits even more because in those languages, the character limit needs to be 16 characters. If you want to put a 64 character line or a 500 character line into a subtitle for a 16 character limit language, the readability impact would be much worse. These things also do not scale linearly, so it's not twice as worse, it is exponentially worse (the smaller your font gets, the bigger the readability of each font step).

ryan-lp commented 1 year ago

The photo below illustrates typical line lengths we can expect if we permit arbitrary line lengths (although they can get much worse than even this):

If I hold the phone up to my nose like this, I can just make out the 2 lines of the SRT block. However, I have to strain. Some people have poor stereo vision (lazy eye) that makes it difficult to align both eyes on smaller text, other people may have trouble simply focusing on small things, and others may have astigmatisms. There are various different issues with eyesight but they are all made worse when the text is small. And of course that tool is capable of much wider lines in the worst case, resulting in the text being impossible for anyone to read.

The tool you are planning to use only has an SRT option as a convenience for researchers. It does not conform to any actual standard for SRT in terms of line length as would be necessary on the consumer end, and so it should not be implemented directly in production as is. You need to take the code and actually modify it to meet the requirements for its use. Even if you were to argue for a wider limit, there still needs to be "some" limit to avoid 500 character lines, and thus it is unavoidable that you will have to write some code to make it conform to reasonable limits.

An SRT block is defined to be two lines maximum and the geometry is intended to fit within a certain sized display. The goal here is to fit inside a narrow mobile phone display. If we rotated the screen sideways into landscape mode, that would allow for wider lines, but podcast apps are typically designed for the portrait orientation.

daveajones commented 1 year ago

Going to close this since it seems to be settled.

daveajones commented 1 year ago

Re-opening this. Sorry @benjaminbellamy. There is also linkage to #370

ryan-lp commented 1 year ago

I think the question of whether we should have a character limit at all is settled. We do need one so that apps know what to expect, and what we have is in line with other SRT publishing standards. We can have a discussion about what is the most appropriate character limit for a narrow mobile phone screen, but having "no limit" is not a viable answer.

@benjaminbellamy my offer still stands to provide coding assistance. Changing the spec according to the limitations of a particular transcription tool is probably not wise, particularly when it is possible to fix it through code.

Note that the linked issue #370 was never intended to support the idea that we could "abandon" having a character limit, we definitely need a limit. Rather, the purpose of #370 was to acknowledge that the character limit should be different for languages that use double-width characters. I propose a simple change to the spec in which each double-width character simply counts as 2 normal characters (for character counting purposes). That way, in a multilingual podcast where a single line contains both English and Chinese words, the character counting will still arrive at the correct line width, adjusted for any double-width characters that the line contains.

francosolerio commented 1 year ago

The way I see the issue is our apps run on so many different devices and screen sizes that it's really difficult to tell what's best in advance. Any app can check the available space at runtime, tweak font size and let automatic word wrap do the job.

ryan-lp commented 1 year ago

(I accidentally hit the "send" button before I had finished drafting this comment, so please disregard that if you received the notification.) Anyway - My response:

You need to be careful with that assumption.

First, the main issue with the current transcript spec is that it was a bit too English-centric or European-language-centric, initially by not recognising that some languages have double-width characters and hence need an adjusted line limit. But when it comes to "automatic word wrapping", we also need to try to do this in a way that considers all languages. For example, Japanese and Chinese don't use any spaces to separate words, and the TextView widget in Android does not actually do "automatic" line wrapping in Japanese and Chinese for that reason - there are no spaces that indicate where exactly the line should be wrapped. If your app would rather not concern itself with the extra processing required to parse Japanese and Chinese text so that it can implement its own line wrapping correctly for every language, you can fortunately just render the SRT block the way the SRT publisher rendered it for you, as they will have used appropriate line splitting guidelines for the language in which they're publishing.

Second, the idea that adjusting the font size and line wrapping will always solve the problem in "any" app is not right. Remember that the proposal above was to remove the character limit altogether because the transcript tool @benjaminbellamy is using can produce extremely long lines (100s of characters long). That's just not acceptable for a mobile app, which is the most common device people use for podcast consumption. Even if your app doesn't mind wrapping a 500 character long line all over the entire user interface obscuring whatever is behind it, I think that some apps (e.g. mine) wants to reserve two lines at the bottom for the captions, without obscuring the album and chapter art square at the top, without obscuring the episode title, and without obscuring the playback controls.

The other thing is that it is easier to start with a standard that works for the mobile phone and then use adaptive layout techniques to make it work on a tablet or desktop than it is to go in the reverse direction. For example, if for some reason you really think that on a tablet or desktop users want to see twice as much text at a time, then you could simply show two SRT blocks at a time, concatenating them together. But if you were to change the standard itself so that it were first optimised for that kind of target, you can't reverse the process. That is, you can't split an SRT block and fabricate timestamp at the split point. And the alternative of just freely allowing long lines with no limit is really just going to cause a text overflow or a line wrap that obscures other important parts of the mobile screen in what is a very limited screen real estate.

I would maintain that industry standard of somewhere in the ballpark of 32 to 47 characters per line is also appropriate here. I happen to think 32 is more appropriate given that we're mostly dealing with mobile phones in portrait orientation, and that a feature primarily designed for accessibility reasons should not force the font size to be any smaller for those who are not only hearing impaired but also vision impaired.

Edit: Two more points to add as an afterthought.

First, even if we recognise that some limit is necessary (otherwise people could just put the entire transcript in a single 100000 character length line - and if you think nobody will do it, people are already doing that for the JSON transcript format because the standard permits that -- which is why the JSON format has turned out to be useless in practice) -- continuing, even if we recognise that some limit is necessary in order for the standard to actually be useful, it would still be problematic to now change what has already been standardised, as it will break apps.

From memory, I have already tested this with Podcast Addict, but if we take subtitles rendered from an SRT block that is conformant to the standard of 32 character limits, we get something like this:

Hello and welcome to this week's
  episode of Footalk. Today we

where, the font size has been chosen based on the expected character limits. If we produce a non-conformant SRT with marginally longer lines and ask the app to render it, we get this:

Hello and welcome to this week's
episode
of Footalk. Today we have as a
guest on

Either standard would be fine, but changing the standard now would break apps and fragment the SRT data.

Secondly, we shouldn't treat font shrinking and line wrapping as a magic solution to the problem, not only because it affects readability and accessibility, but also since UX/UI designs need to reserve a particular amount of space for different components of the screen. Many would say it looks horrible and inconsistent to have different successive SRT blocks with different font sizes, and/or with the occasional SRT block causing a text overflow or an excessive text wrap. Maybe some apps will find it to be an acceptable design tradeoff, but many will have put careful thought into exactly what typeface and what font size will work best for the consistent display of all subtitles.

adamc199 commented 1 year ago

Allow me to jump in to show you how people are actually using transcripts/subtitles/closed captions in the real world:

https://www.youtube.com/watch?v=ny7i2Yo9CoU

There are MANY ways to display this kind information and we all need to go out and take a good look at what is happening in the real world.

ryan-lp commented 1 year ago

We can support all the different use cases, but we just need to specify each different transcript file format well enough so that apps can rely on the right file format for the right use case. See #484 for a discussion of the issues and proposal.

But in short, the JSON format "could have" been perfect for the type of use case you linked to, except that it lacks the kind of specificity that SRT enjoys. The specificity of the format is what allows it to be useful: SRT for reading the transcript by line, and JSON (if we were to tighten the spec) for reading by word.

jamescridland commented 1 year ago

How's this?

"SRT files are for use in simple closed-captions or can be parsed for display on a website. SRT lines should be limited to 32 visible characters, to ensure they are visible on a wide variety of screen sizes without alteration. SRT files are more widely supported than other formats, and are highly recommended."

"JSON files are for use in more advanced circumstances, and contain timing for each word's start and end."

JSON files would be used by Adam's fancy examples. They could also be used by, for example, Hindenburg Journalist Pro to edit audio further (though I doubt that's a use-case we want to promote).

I'd personally like to ensure that SRT is the bare minimum and are supported by all.

ryan-lp commented 1 year ago

I would mostly agree subject to the international considerations in #370 .

tomrossi7 commented 1 year ago

Personally, I think the JSON format is the one we should be investing time in improving.

JSON files are for use in more advanced circumstances, and contain timing for each word's start and end.

The beauty of this format is that it can provide as much or as little fidelity as possible. The highest fidelity being 1 word per timestamp, but sometimes its a phrase or a complete sentence dependent on the source of the transcript.

Once you have a transcript in a JSON format, you could transform it into any SRT with any length you want as long as the fidelity is good enough.

ryan-lp commented 1 year ago

I suggest we keep this discussion focused on the particular SRT issue mentioned in the description. We can have different discussions about SRT and JSON happen in parallel, but we're not going to get anywhere fast if we cover lots of different topics in the same thread. Here are other more targeted discussions that touch on the specific points you bring up.

Regarding "The beauty of this format (JSON) is that it can provide as much or as little fidelity as possible", the contrary view is taken in #484 , and your comment would add usefully there. I argue that we actually don't want an unspecified fidelity, and that is what makes the format difficult to use for apps that are based on word timestamps, because it's hard to know whether they can rely on that file "actually" containing word-level fidelity when it is allowed to be arbitrary. I've seen someone put an entire episode into a single segment. I've also seen someone start with what appeared to be word-level timestamps, and then suddenly somewhere in the middle of the transcript there was a segment with 70 words in it. That's not going to go well for a karaoke app that assumes word timestamps. One "benefit" of this format is that makes it easy for the developer of transcript editing software to allow you to edit a section of words without having to do the hard work of re-syncing the new words with their new timestamps. But that's only a short-sighted benefit for the producers that causes problems down the line for consumers of that data.
Regarding the "you could transform (JSON) into any SRT)", this comment would add usefully in the context of #519 because it gives a background on what use cases are enabled by JSON and not SRT -- and vice versa. It is tempting, of course, to think that you can easily derive the equivalent of an SRT from a JSON with a simple word-wrap algorithm, however SRT alllows the publisher to indicate appropriate places for line breaks, which is especially important in music lyrics. Also, when a lot of people think of ideas that at first seem beautiful for their simplicity, they might come from an English speaking background or some other specific language where the convenient properties the idea depends on happen to be present in the particular language THEY'RE familiar with. But then when you consider other languages that have rather different rules on where you are allowed to break lines, you may realise that the software you need to implement the line splitting algorithms for various difficult languages is something more appropriate for a backend server with more RAM.

I know I've said various things above that people might want to respond to, but those are just summaries of what I said in the other issues, and so those discussions could be continued there.

ryan-lp commented 1 year ago

@benjaminbellamy I have since added official support into Whisper for character limits and line limits. To generate a transcript that meets the podcast namespace standard, use the options below:

whisper --output_format srt,vtt --word_timestamps True --max_line_width 32 --max_line_count 2 episode.mp3

Note that I have not added support for the unique rules of Japanese and some other languages because the Whisper code base is not intended to have such complexity in it. If you are building your own server side transcription service (as I am doing), you will want to add more complicated code to handle some of these more complex languages. However, the way it currently works should still meet the 32 character limit and the 2 line count limit, it's just that it may split lines in places that Japanese people might not expect lines to be split. The official releases still lag behind git, so for now you'll need to download the git version of Whisper.

Given that we don't want to have arbitrary fidelity in SRT files (e.g. we don't want 10,000 character lines to be allowed), we need some limit. Whatever that limit is, even if you were to want a different limit, I believe that you now have the command line option that you were originally seeking, and this hopefully resolves the issue for you.

ryan-lp commented 8 months ago

I think this issue can be closed.

Podcastindex-org / podcast-namespace

SRT lines should not be limited to 32 characters #407