bbolli / tumblr-utils

Utilities for dealing with Tumblr blogs, Tumblr backup
GNU General Public License v3.0
665 stars 124 forks source link

Feature: Download notes (replies, likes, etc.) #169

Open humanitiesclinic opened 5 years ago

humanitiesclinic commented 5 years ago

I refer to post #98. What is the status of this? Are Notes/Comments now downloadable by the backup script? I tried, but I dun see any Notes/Comments downloaded as of now.

Is this what the --likes command is for?

cebtenzzre commented 5 years ago

If you add notes_info to the API parameters, you can get a ~full~ short list of:

For every post downloaded. It'll only do anything useful with -j currently, which dumps the API response for each post into a JSON file. (You'll have to read the JSON to see this information right now.)

The following patch will add this functionality:

diff --git a/tumblr_backup.py b/tumblr_backup.py
index 338c7d3..220dd25 100755
--- a/tumblr_backup.py
+++ b/tumblr_backup.py
@@ -195,7 +195,7 @@ def set_period():

 def apiparse(base, count, start=0):
-    params = {'api_key': API_KEY, 'limit': count, 'reblog_info': 'true'}
+    params = {'api_key': API_KEY, 'limit': count, 'reblog_info': 'true', 'notes_info': 'true'}
     if start > 0:
         params['offset'] = start
     url = base + '?' + urllib.urlencode(params)
wertercatt commented 5 years ago

@Cebtenzzre This patch does not download all the notes due to the Tumblr API only returning 50. https://stackoverflow.com/a/14428010 should help, but you'll need to scrape the /notes/ URL out of the rendered post HTML as well as scrape the paginated URLs out of the /notes/ pages to get the next page of notes.

cebtenzzre commented 5 years ago
import dryscrape
import re
from bs4 import BeautifulSoup

def get_more_link(sess, base, url):
    sess.visit(url)
    soup = BeautifulSoup(sess.body(), 'lxml')
    element = soup.find('a', class_='more_notes_link')
    if not element:
        return None
    onclick = element.get_attribute_list('onclick')[0]
    return base + re.search(r";tumblrReq\.open\('GET','([^']+)'", onclick).groups()[0]

base = 'https://uri-hyukkie.tumblr.com'
url = base + '/post/61181809095'
session = dryscrape.Session()

while True:
    url = get_more_link(session, base, url)
    if not url:
        break
    print url
    session.visit(url)
    soup = BeautifulSoup(session.body(), 'lxml')
    notes = soup.find('ol', class_='notes').find_all('li')[:-1]
    for n in notes:
        print n.prettify()

There's a proof-of-concept script to scrape the notes from a post that was linked in another StackOverflow answer by unor. Any remarks before I try to integrate it into tumblr-utils? (I'm technically still learning this language...)

EDIT: Yes, I realize that there are minor issues here, and that I'm doing duplicate work. I'm fixing that in the version I'm working on.

cebtenzzre commented 5 years ago

I've made a PR for this (#189).