Open gregdurrett opened 4 years ago
Hi Greg, from what I remember, the HTML looks a bit different than what it used to be. So my guess is our parser cannot find the correct class item anymore. I'll need a few days to look at it though, hope that's ok. If you manage to fix it in the meanwhile, please send a pull request. Thanks!
No huge rush, we just wanted to explore this for a possible project in the fall semester. Thanks!
hi @achaudhury / @makarandtapaswi, i've fixed the parser to handle the new layout on Shmoop.com. just tried pushing the branch to open a PR with the changeset but received:
(base) amith@Amiths-MBP shmoop-corpus % git push origin fix-parsing
remote: Permission to achaudhury/shmoop-corpus.git denied to amith-ananthram.
fatal: unable to access 'https://github.com/achaudhury/shmoop-corpus.git/': The requested URL returned error: 403
would you mind opening up the permissions for me to share the PR? thanks!
nvm @achaudhury / @makarandtapaswi, was able to do it via forking; here it is: https://github.com/achaudhury/shmoop-corpus/pull/3. for some reason it's not letting me add you as reviewers but presumably you can still review it!
Added the PR, thanks a lot @amith-ananthram! Currently running it to analyze if there are major changes from our version. Will ping back here and close the issue.
Hi,
Today I tried to run get_summaries.py. The end of the output looks like this:
"""
and all of the directories under summaries/ are empty after the script is run.
The problem seems to be when it fetches from this address: 'https://www.shmoop.com/20000-leagues-under-the-sea/summary.html'
this code:
returns an empty list of sections. There's redirect in there, maybe the script doesn't handle that correctly?
Please advise, thanks!
Greg