Open JohnVeness opened 4 years ago
I'm no Python expert, but changing the line output.append(t + "\n\n")
to output.append(html_parser.unescape(t) + "\n\n")
in parsing.py function edx_json2srt seems to fix this. I'm not sure if this is the best method or whether BeautfulSoup or something else should be used instead.
🚨Please review the Troubleshooting section before reporting any issue. Don't forget also to check the current issues to avoid duplicates.
Subject of the issue
Downloaded .srt files contain HTML character references such as
'
which don't look correct when watching the video.Your environment
Steps to reproduce
edx-dl -s -u <censored> https://courses.edx.org/courses/course-v1:MITx+6.002.1x+2T2019/course/ --filter-section 2
Expected behaviour
On screen subtitle "So, for example, you will learn what's behind this."
Actual behaviour
On screen subtitle "So, for example, you will learn what
'
s behind this."Observations
The edx video player on the website displays the subtitles correctly, so their server code must be doing things correctly when showing the video in the browser.
However, I notice that if you use the website feature to download the .srt files, generated by their server code, they also include the incorrect elements and look wrong in a video player! This means that my suggestion in #610 to allow the edx website to generate the .srt for you would not fix this issue.