aguerradelgado / goodreads_webscraper_2

0 stars 0 forks source link

fix 50% functionality issue #16

Closed aguerradelgado closed 1 year ago

aguerradelgado commented 1 year ago

current code causes race conditions where it does not always scrape the information

aguerradelgado commented 1 year ago

Greenwell's response: From Grace:

We scrape goodreads for the genre, author, ect. I have noticed that it only renders about 50% of the time. It's not grabbing author, for instance, 50% of the time.

Issue is one is trying to scrape all this information one at a time in different functions - just make a dictionary of all this information (not every book has this information so it won't always show up). From my old functional code it looks like you need to get the author link then split the author out of that. I'll note this is more than you need as most of the dictionary keys reference returns from functions I created so you can still use your functions but they just need to parse the text and return what you want over new calls:

def scrape_book(book_id): url = 'https://www.goodreads.com/book/show/' + book_id source = urlopen(url) # using the old urllib.request library I was soup = bs4.BeautifulSoup(source, 'html.parser')

time.sleep(2)

using functional programming with an anonymous dictionary being used as a dispatch table and return value

I.e. it returns a dictionary where 'isbn' would be the key for whatever returned from get_isbn(soup)

return {'book_id_title':        book_id,
        'book_id':              get_id(book_id),
        'book_title':           ' '.join(soup.find('h1', {'id': 'bookTitle'}).text.split()),
        "book_series":          get_series_name(soup),
        "book_series_uri":      get_series_uri(soup),
        'isbn':                 get_isbn(soup),
        'isbn13':               get_isbn13(soup),
        'year_first_published': get_year_first_published(soup),
        'authorlink':           soup.find('a', {'class': 'authorName'})['href'],
        'author':               ' '.join(soup.find('span', {'itemprop': 'name'}).text.split()),
        'num_pages':            get_num_pages(soup),
        'genres':               get_genres(soup),
        'shelves':              get_shelves(soup),
        'lists':                get_all_lists(soup),
        'num_ratings':          soup.find('meta', {'itemprop': 'ratingCount'})['content'].strip(),
        'num_reviews':          soup.find('meta', {'itemprop': 'reviewCount'})['content'].strip(),
        'average_rating':       soup.find('span', {'itemprop': 'ratingValue'}).text.strip(),
        'rating_distribution':  get_rating_distribution(soup)}

In the above I didn't need a function for author because it was in the link as a solid id (authorName) - don't know if that is still true but start there.

aguerradelgado commented 1 year ago

closed by grace