ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

[SSL: UNSUPPORTED_PROTOCOL] with phantomjs #68

Closed rwoodpecker closed 8 years ago

rwoodpecker commented 8 years ago

I've been having this issue when using phantomjs (1.9.8) on reddit.

My command looks like:

grab-site https://www.reddit.com/r/subreddit/ --no-offsite-links --concurrency 3 --wpull-args="--phantomjs --phantomjs-scroll 0 --phantomjs-exe=$HOME/phantomjs/bin/phantomjs" --igsets=reddit --delay=100-250

Things seem to start up fine... But then.

^https?://{primary_netloc}/search(/label/[^\?]+|)\?updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max results=\d+
https://www.reddit.com/r/subreddit/ ...
ERROR Proxy error
Traceback (most recent call last):
  File "/home/gs/.local/lib/python3.4/site-packages/wpull/proxy/server.py", line 57, in __call__
    yield From(session())
  File "/home/rgs/.local/lib/python3.4/site-packages/trollius/tasks.py", line 250, in _step
    result = coro.throw(exc)
  File "/home/gs/.local/lib/python3.4/site-packages/wpull/proxy/server.py", line 111, in __call__
    yield From(self._start_tls())
  File "/home/gs/.local/lib/python3.4/site-packages/trollius/tasks.py", line 252, in _step
    result = coro.send(value)
  File "/home/gs/.local/lib/python3.4/site-packages/wpull/proxy/server.py", line 207, in _start_tls
    ssl_socket.do_handshake()
  File "/usr/lib/python3.4/ssl.py", line 804, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: UNSUPPORTED_PROTOCOL] unsupported protocol (_ssl.c:600)

Nothing happens from this point and I have to forcefully terminate the grab. Usually I have to run the grab and kill it 2-3 times before it works as it should. I've noticed that if I terminate gs-server and re-start it I don't always receive the error above and things normally work.. but not always. Sorry, I can't really be more precise about the behaviour of this.

ivan commented 8 years ago

You should file this with wpull; I have no idea what goes on in wpull's phantomjs support. grab-site's --wpull-args=--phantomjs is deliberately unsupported and undocumented.

If there's any way to archive it without phantomjs, I would recommend that. Reddit itself doesn't need JS execution, AFAIK. If you're using phantomjs for offsite links, perhaps mass-submitting URLs to archive.is would work better.

rwoodpecker commented 8 years ago

I've done a little testing and without phantomjs I can't get the previous and next page buttons to function, so you can't really browse reddit - however it does still seem to grab all the posts.

ivan commented 8 years ago

What do you mean? I see a 'next ›' button at the bottom of https://www.reddit.com/r/subreddit/ that works without running any JavaScript in Firefox. If you mean another subreddit, maybe they're just hiding the buttons with CSS?

rwoodpecker commented 8 years ago

I should have clarified. It seems that without using phantomjs (regardless of the subreddit) the next button doesn't 'work' because it hasn't grabbed the ?count= after the URL that gets appended when the 'next' button is clicked. So basically in the WARC the next page cannot be displayed and I can't browse because it hasn't even been grabbed.

https://www.reddit.com/r/subreddit/?count=25&after=t3_3u0g6r

ivan commented 8 years ago

I see my crawls grabbing after= pages, but I can think of two reasons why you might see broken Next links.

1) If you're sorting the subreddit first by 'new' or 'top' or something else, the Next links aren't crawled because this ignore skips over them: https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/reddit#L9

2) Perhaps a redirect lands you on a page for a second time, so whichever page webarchiveplayer (or similar) picks up has the wrong after= link?