Closed rwoodpecker closed 8 years ago
You should file this with wpull; I have no idea what goes on in wpull's phantomjs support. grab-site's --wpull-args=--phantomjs
is deliberately unsupported and undocumented.
If there's any way to archive it without phantomjs, I would recommend that. Reddit itself doesn't need JS execution, AFAIK. If you're using phantomjs for offsite links, perhaps mass-submitting URLs to archive.is would work better.
I've done a little testing and without phantomjs I can't get the previous and next page buttons to function, so you can't really browse reddit - however it does still seem to grab all the posts.
What do you mean? I see a 'next ›' button at the bottom of https://www.reddit.com/r/subreddit/ that works without running any JavaScript in Firefox. If you mean another subreddit, maybe they're just hiding the buttons with CSS?
I should have clarified. It seems that without using phantomjs (regardless of the subreddit) the next button doesn't 'work' because it hasn't grabbed the ?count= after the URL that gets appended when the 'next' button is clicked. So basically in the WARC the next page cannot be displayed and I can't browse because it hasn't even been grabbed.
https://www.reddit.com/r/subreddit/?count=25&after=t3_3u0g6r
I see my crawls grabbing after= pages, but I can think of two reasons why you might see broken Next links.
1) If you're sorting the subreddit first by 'new' or 'top' or something else, the Next links aren't crawled because this ignore skips over them: https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/reddit#L9
2) Perhaps a redirect lands you on a page for a second time, so whichever page webarchiveplayer (or similar) picks up has the wrong after= link?
I've been having this issue when using phantomjs (1.9.8) on reddit.
My command looks like:
Things seem to start up fine... But then.
Nothing happens from this point and I have to forcefully terminate the grab. Usually I have to run the grab and kill it 2-3 times before it works as it should. I've noticed that if I terminate gs-server and re-start it I don't always receive the error above and things normally work.. but not always. Sorry, I can't really be more precise about the behaviour of this.