achaudhury / shmoop-corpus

The Shmoop Corpus
MIT License
16 stars 6 forks source link

get_summaries returns nothing? #2

Open gregdurrett opened 4 years ago

gregdurrett commented 4 years ago

Hi,

Today I tried to run get_summaries.py. The end of the output looks like this:

"""

  1. Where Angels Fear to Tread <<<

  2. The White Devil <<<

  3. White Fang <<<

  4. The Wings of the Dove <<<

  5. The Winters Tale <<<

  6. The Woman in White <<<

  7. The Wonderful Wizard of Oz <<<

  8. Wuthering Heights <<<

  9. The Jungle Book <<< """

and all of the directories under summaries/ are empty after the script is run.

The problem seems to be when it fetches from this address: 'https://www.shmoop.com/20000-leagues-under-the-sea/summary.html'

this code:

soup = BeautifulSoup(urllib.request.urlopen(html_address), "html.parser")
sections = soup.findAll("li", {"data-class" : "SHEvent"})

returns an empty list of sections. There's redirect in there, maybe the script doesn't handle that correctly?

Please advise, thanks!

Greg

makarandtapaswi commented 4 years ago

Hi Greg, from what I remember, the HTML looks a bit different than what it used to be. So my guess is our parser cannot find the correct class item anymore. I'll need a few days to look at it though, hope that's ok. If you manage to fix it in the meanwhile, please send a pull request. Thanks!

gregdurrett commented 4 years ago

No huge rush, we just wanted to explore this for a possible project in the fall semester. Thanks!

amith-ananthram commented 3 years ago

hi @achaudhury / @makarandtapaswi, i've fixed the parser to handle the new layout on Shmoop.com. just tried pushing the branch to open a PR with the changeset but received:

(base) amith@Amiths-MBP shmoop-corpus % git push origin fix-parsing
remote: Permission to achaudhury/shmoop-corpus.git denied to amith-ananthram.
fatal: unable to access 'https://github.com/achaudhury/shmoop-corpus.git/': The requested URL returned error: 403

would you mind opening up the permissions for me to share the PR? thanks!

amith-ananthram commented 3 years ago

nvm @achaudhury / @makarandtapaswi, was able to do it via forking; here it is: https://github.com/achaudhury/shmoop-corpus/pull/3. for some reason it's not letting me add you as reviewers but presumably you can still review it!

makarandtapaswi commented 3 years ago

Added the PR, thanks a lot @amith-ananthram! Currently running it to analyze if there are major changes from our version. Will ping back here and close the issue.