OpenGenus / vidsum

Generate summary of any video :tv: anywhere and anytime
GNU General Public License v3.0
262 stars 69 forks source link

Srt file encoding related errors #26

Closed shriakhilc closed 7 years ago

shriakhilc commented 7 years ago

I tried running the program as described in the README with a sample .avi file and a .srt file (both english). The first error I faced was a UnicodeDecodeError, the error log of which is as follows:

Traceback (most recent call last):
  File "sum.py", line 155, in <module>
    get_summary(args.video_file, args.subtitles_file)
  File "sum.py", line 110, in get_summary
    regions = find_summary_regions(subtitles, 60, "english")
  File "sum.py", line 72, in find_summary_regions
    srt_file = pysrt.open(srt_filename)
  File "C:\Python34\lib\site-packages\pysrt\srtfile.py", line 153, in open
    new_file.read(source_file, error_handling=error_handling)
  File "C:\Python34\lib\site-packages\pysrt\srtfile.py", line 181, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "C:\Python34\lib\collections\__init__.py", line 1016, in extend
    self.data.extend(other)
  File "C:\Python34\lib\site-packages\pysrt\srtfile.py", line 204, in stream
    for index, line in enumerate(chain(source_file, '\n')):
  File "C:\Python34\lib\codecs.py", line 707, in __next__
    return next(self.reader)
  File "C:\Python34\lib\codecs.py", line 638, in __next__
    line = self.readline()
  File "C:\Python34\lib\codecs.py", line 551, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Python34\lib\codecs.py", line 497, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 39: invalid start byte

After checking out the pysrt readme, I realized it could be because of an encoding mismatch. Running the file with pysrt.open(srt_filename, encoding='iso-8859-1') fixed this error. One easy way to fix this is to first use the chardet module to detect the encoding, and then pass that to pysrt.

Next, I also faced a LookupError next due to the NLTK Tokenizer not being able to find punkt. I recommend adding this as a requirement, along with the youtube_dl library which was needed when the -u flag was being used. Are you developing the file primarily on Linux? That might explain why you didn't notice the need, since most of these might be pre-installed in it.

I can make the necessary changes and send a pull request in a couple of hours. Will that be fine?

shriakhilc commented 7 years ago

Also, could someone expand on the purpose of the program? "Generate a summary of any video through its subtitles." is quite short and vague.

From my tests, I saw that when the -i and -s parameters are used on a video of around 30 minutes, it compressed it to a video of around 1 minute by omitting a lot of scenes. Including the manner in which the scenes in the final video are selected would let developers confirm whether the output is as expected, without having to crawl through and understand the actual code being used to do the compression. (Though the code is quite small in this case, and manual reading is possible)

On the other hand, using the -u flag takes a YouTube video URL and downloads that video, plus subtitles in either srtor vtt formats. But it doesn't actually generate a summary video. And since the subtitles must be in srt for the -s flag, I don't see why it is downloading the vtt file at all.

Thanks in advance for explaining, it'll help make PRs and changes more efficiently and according to your idea of the program use rather than my vague interpretation of it.

shriakhilc commented 7 years ago

Fixed in PR #27