ercanserteli / condenser

Condenser allows you to extract speech audio from video files, based on subtitle timings. By omitting the audio outside of speech, it increases the language per second that you are getting exposed to.
https://ercanserteli.com/condenser
GNU General Public License v3.0
33 stars 7 forks source link

invalid start byte / subtitle encoding (...) text to text or bitmp to bitmap #9

Open mavhamm opened 1 year ago

mavhamm commented 1 year ago

Hi, great program!

Recently I haven't been able to make it work on any new downloads. I'm on Windows10, using v1.3.1

Here's the info on this mkv: https://prnt.sc/pfuBqRiW-Ngp Here is the srt: https://www.mediafire.com/folder/bg354za9xfn7q/condenser_issue_2023-01

Log when it uses the internal subtitle file

Could not extract subtitle with ffmpeg: b'Subtitle encoding currently only possible from text to text or bitmap to bitmap\r\n'
Traceback:
Traceback (most recent call last):
  File "condenser.py", line 353, in main
  File "condenser.py", line 184, in get_srt
  File "condenser.py", line 161, in extract_srt
Exception: Could not extract subtitle with ffmpeg: b'Subtitle encoding currently only possible from text to text or bitmap to bitmap\r\n'

Log when it uses the external srt subtitle

'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
Traceback:
Traceback (most recent call last):
  File "condenser.py", line 357, in main
  File "condenser.py", line 211, in condense
  File "condenser.py", line 72, in extract_periods
  File "lib\site-packages\pysrt\srtfile.py", line 153, in open
  File "lib\site-packages\pysrt\srtfile.py", line 181, in read
  File "C:\Users\xerca\AppData\Local\Programs\Python\Python37\lib\collections\__init__.py", line 1127, in extend
  File "lib\site-packages\pysrt\srtfile.py", line 204, in stream
  File "C:\Users\xerca\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 714, in __next__
  File "C:\Users\xerca\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 645, in __next__
  File "C:\Users\xerca\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 558, in readline
  File "C:\Users\xerca\AppData\Local\Programs\Python\Python37\lib\codecs.py", line 504, in read
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 0: invalid start byte
ercanserteli commented 1 year ago

It seems that your mkv's subtitle stream type is PGS which stands for Presentation Graphic Stream Subtitle Format, and since it is a bitmap (image) format ffmpeg can't extract it to a srt file that condenser can read. This is something I can't support for now, but I welcome anyone who would like to implement a solution with a pull request.

The second problem with the srt file has to be a different problem, but it seems the file on mediafire got deleted. If you are still interested, I would be happy if you could re-upload the problematic srt file.