ArmindoFlores / ao3_api

An unofficial archiveofourown.org (AO3) API for python
MIT License
166 stars 64 forks source link

Get comments #14

Closed darthnithin closed 3 years ago

darthnithin commented 3 years ago

No matter what I do, I can't get work.get_comments() to work. Code: from time import time import bs4 import requests import AO3 work = AO3.Work(24560008) work.load_chapters() start = time() comments = work.get_comments(1, 5) print(f"Loaded {len(comments)} comment threads in {round(time()-start, 1)} seconds\n") for comment in comments: print(f"Comment ID: {comment.comment_id}\nReplies: {len(comment.get_thread())}") Error: Traceback (most recent call last): File ".\ao3.py", line 8, in <module> comments = work.get_comments(1, 5) File "C:\Users\nithi\AppData\Local\Programs\Python\Python38-32\lib\site-packages\AO3\works.py", line 272, in get_comments ol = div.find("ol", {"class": "pagination actions"}) AttributeError: 'NoneType' object has no attribute 'find'

darthnithin commented 3 years ago

print(work.comments) works though, it returns the number of comments

Wrench-wench commented 3 years ago

Could you better format your pasted code for readability please?

darthnithin commented 3 years ago

The code is the same as the example here, but it doesn't work even with a different work id.

from time import time
import bs4
import requests
import AO3
url = "https://archiveofourown.org/works/20125552/chapters/47677465"
workid = AO3.utils.workid_from_url(url)
work = AO3.Work(workid);
work.load_chapters()
start = time()
comments = work.get_comments(1, 5)
print(f"Loaded {len(comments)} comment threads in {round(time()-start, 1)} seconds\n")
for comment in comments:
    print(f"Comment ID: {comment.comment_id}\nReplies: {len(comment.get_thread())}")
Wrench-wench commented 3 years ago

Thank you, I'm getting the same error on my machine too with the same message.

Probably requires @ArmindoFlores to take a look

darthnithin commented 3 years ago

I'm pretty sure its erroring here:

        string = "work_id" if self.oneshot else "chapter_id" 
        url = f"https://archiveofourown.org/comments/show_comments?page=%d&{string}={chapter_id}"
        soup = self.request(url%1)

        pages = 0
        div = soup.find("div", {"id": "comments_placeholder"})
        ol = div.find("ol", {"class": "pagination actions"})
        if ol is None:
            pages = 1
        else:
            for li in ol.findAll("li"):
                if li.getText().isdigit():
                    pages = int(li.getText())   

        comments = []
ArmindoFlores commented 3 years ago

I have confirmed this is an issue. I suspect something might have changed on AO3, but I'll look through this project's commit history in case I accidentally changed anything that broke this.

darthnithin commented 3 years ago

Well I just made an AO3 fic comment scraper myself... I learned python for this lol. Also, I am getting the same error I think that it is erroring if a page doesn't have comments, meaning that the div wouldn't exist. Also possible that i'm being rate limited

from bs4 import BeautifulSoup
import requests
import numpy as np
import re
import array as arr
import json
data = {}
data['comments'] = []
chapid = 64257376
workid = 20125552
pagenumber = 1
page = True
comarr = []
chapterids = arr.array('I', [])
nav = f'https://archiveofourown.org/works/{workid}/navigate'
htmldoc = requests.request('get', nav).text
soup = BeautifulSoup(htmldoc, 'html.parser')
soup.prettify()
reg = re.compile(r'/works/\d{8}/chapters/(\d{8})') 
for link in soup.find_all('a'):
    hrefi = link.get('href')
    cid = re.findall(reg, hrefi)
    if cid:
        chapterids.append(int(cid[0]))
for x in chapterids:
    while page:
        url = f'https://archiveofourown.org/chapters/{x}?page={pagenumber}&show_comments=true&view_adult=true#comments'
        htmldoc = requests.request('get', url).text
        soup = BeautifulSoup(htmldoc, 'html.parser')
        soup.prettify()
        div = soup.find("div", {"id": "comments_placeholder"})
        print(url)
        if div:
            ol = div.find("ol", {"class": "pagination actions"})
            comment = div.find_all(class_ ="userstuff")
            comment = div.find_all("blockquote", {"class": "userstuff"})
            i = 0
            for each in comment: 
                ab = str(each.get_text())
                comarr.append(ab)
                data['comments'].append({
                    'chapterid': x,
                    'commenttext' : ab
                })
            i += 1
        else:
            print("DIV is nonetype")

        if ol:
            page = ol.find('a', {'rel': 'next'})
            if page:
                pagenumber += 1
                continue
        else:
            break
        #print(pagenumber)
        #print(url)
        # print(comment.get_text())
        # print(comment)
    # print(comarr[2])
    print(pagenumber)
with open('data.txt', 'w') as outfile:
    json.dump(data, outfile)
for y in comarr:
    print(y)
darthnithin commented 3 years ago

Yup i'm being rate limited image

ArmindoFlores commented 3 years ago

AO3 will throw an HTTPError if it gets rate-limited. Also, these changed are implemented on the new 2.0.4 version only.

darthnithin commented 3 years ago

Indeed it did that. I just was ignoring it lol...