Closed aguerradelgado closed 1 year ago
Greenwell's response: From Grace:
We scrape goodreads for the genre, author, ect. I have noticed that it only renders about 50% of the time. It's not grabbing author, for instance, 50% of the time.
Issue is one is trying to scrape all this information one at a time in different functions - just make a dictionary of all this information (not every book has this information so it won't always show up). From my old functional code it looks like you need to get the author link then split the author out of that. I'll note this is more than you need as most of the dictionary keys reference returns from functions I created so you can still use your functions but they just need to parse the text and return what you want over new calls:
def scrape_book(book_id): url = 'https://www.goodreads.com/book/show/' + book_id source = urlopen(url) # using the old urllib.request library I was soup = bs4.BeautifulSoup(source, 'html.parser')
time.sleep(2)
return {'book_id_title': book_id,
'book_id': get_id(book_id),
'book_title': ' '.join(soup.find('h1', {'id': 'bookTitle'}).text.split()),
"book_series": get_series_name(soup),
"book_series_uri": get_series_uri(soup),
'isbn': get_isbn(soup),
'isbn13': get_isbn13(soup),
'year_first_published': get_year_first_published(soup),
'authorlink': soup.find('a', {'class': 'authorName'})['href'],
'author': ' '.join(soup.find('span', {'itemprop': 'name'}).text.split()),
'num_pages': get_num_pages(soup),
'genres': get_genres(soup),
'shelves': get_shelves(soup),
'lists': get_all_lists(soup),
'num_ratings': soup.find('meta', {'itemprop': 'ratingCount'})['content'].strip(),
'num_reviews': soup.find('meta', {'itemprop': 'reviewCount'})['content'].strip(),
'average_rating': soup.find('span', {'itemprop': 'ratingValue'}).text.strip(),
'rating_distribution': get_rating_distribution(soup)}
In the above I didn't need a function for author because it was in the link as a solid id (authorName) - don't know if that is still true but start there.
closed by grace
current code causes race conditions where it does not always scrape the information