coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.92k stars 638 forks source link

Fix up HTML character references in subtitles #612

Open JohnVeness opened 4 years ago

JohnVeness commented 4 years ago

🚨Please review the Troubleshooting section before reporting any issue. Don't forget also to check the current issues to avoid duplicates.

Subject of the issue

Downloaded .srt files contain HTML character references such as ' which don't look correct when watching the video.

Your environment

Steps to reproduce

  1. edx-dl -s -u <censored> https://courses.edx.org/courses/course-v1:MITx+6.002.1x+2T2019/course/ --filter-section 2
  2. Wait for it to download all the videos and subtitles
  3. Play back video 02-MIT6002XT214-V060600_DTH.mp4 with English subtitles, e.g. in VLC
  4. Skip to 00m17s

Expected behaviour

On screen subtitle "So, for example, you will learn what's behind this."

Actual behaviour

On screen subtitle "So, for example, you will learn what&#39;s behind this."

Observations

The edx video player on the website displays the subtitles correctly, so their server code must be doing things correctly when showing the video in the browser.

However, I notice that if you use the website feature to download the .srt files, generated by their server code, they also include the incorrect elements and look wrong in a video player! This means that my suggestion in #610 to allow the edx website to generate the .srt for you would not fix this issue.

JohnVeness commented 4 years ago

I'm no Python expert, but changing the line output.append(t + "\n\n") to output.append(html_parser.unescape(t) + "\n\n") in parsing.py function edx_json2srt seems to fix this. I'm not sure if this is the best method or whether BeautfulSoup or something else should be used instead.