allejok96 / jw-scripts

Index or download videos and sound recordings from jw.org.
GNU General Public License v3.0
49 stars 10 forks source link

download experiences audio #23

Closed jongkok closed 4 years ago

jongkok commented 4 years ago

Dear brother,

This is not an issue....more of a feature request ^_^ Perhaps there's a way we can download all audio recordings from experiences section ?

Thank you

allejok96 commented 4 years ago

This one is a bit more tricky... There's an API for JW broadcasting, and there's an API for downloading publications. But I haven't seen any API for articles and pages on the website, and I wouldn't think there is any either, because that would be overkill.

That would mean we need a web page scrapper. And that would mean it could break whenever there's an update to the layout etc of the webpage.

I know there's interest in scrapping jw.org, not only for downloading a bunch of audio, but also for things like a jw.org news client for Kodi etc... It would be nice, but it's a bit of a project on its own.

I'll take a look at how the audio recordings are handled, but chances are all solutions are too fragile.

allejok96 commented 4 years ago

May I ask why you need this, and how Python-savvy you are?

allejok96 commented 4 years ago

Yeah if you can get hold of the document ID there is an API to download the MP3s... But the kink is to get the ID... I'm giving you an unorthodox quick fix here and it only works for web articles. Tweak it to suit your needs.

#!/usr/bin/env python3
# Run the program with an jw.org URL as an argument to
# download all recordings that are referenced to in that page
import sys, re, urllib.request, json

lang = 'E'
api_url = 'https://apps.jw.org/GETPUBMEDIALINKS?output=json&alllangs=0&fileformat=MP3&langwritten=' + lang + '&txtCMSLang=' + lang + '&docid='
data = urllib.request.urlopen(sys.argv[1]).read().decode('utf-8')
matches = re.finditer('data-page-id="mid([^"]*)"', data)
ids = set(x.group(1) for x in matches)  # set() removes all doubles

for i in ids:
    try:
        print('requesting data about', i)
        response = urllib.request.urlopen(api_url + i)
    except:
        continue

    tree = json.loads(response.read().decode('utf-8'))
    file_url = tree['files'][lang]['MP3'][0]['file']['url']  # Assuming there's only one MP3
    file_title = tree['files'][lang]['MP3'][0]['title']
    file_name = re.sub('[<>:"|?*/\0]', '', file_title) + '.mp3'  # NTFS safe
    print('downloading', file_title)
    urllib.request.urlretrieve(file_url, filename=file_name)