ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
260 stars 43 forks source link

QS hangs indefintely #81

Closed tarrow closed 7 years ago

tarrow commented 8 years ago

@petermr commented on Thu Jun 02 2016

This URL has been retried and fails for ca 10 mins...

http://dx.doi.org/10.4172/2157-7471.1000s4-003

@blahah commented on Thu Jun 02 2016

should be in the quickscrape repo?


@tarrow commented on Thu Jun 02 2016

Yes, I will move it :)

petermr commented 8 years ago

Here are some DOI roots that hang. It's possible that some of these are due to paywalls.

http://dx.doi.org/10.1017/s0030605316000028
http://dx.doi.org/10.5376/pgt.2015.06.0009
http://dx.doi.org/10.5586/asbp.2006.008
http://dx.doi.org/10.4172/2157-7471.1000s4-003
http://dx.doi.org/10.2903/j.efsa.2013.3069
http://dx.doi.org/10.11623/frj.2013.21.3.25
http://dx.doi.org/10.1094/pdis-02-11-0078-sr.testissue
http://dx.doi.org/10.1017/s0021859613000543
http://dx.doi.org/10.1007/s40858-015-0043-7.
tarrow commented 8 years ago

For the first url we get an error 520 from cloudflare. Why we get this error page when we aren't using thresher but instead use curl I'm not sure. The error looks like this:

{ request: 
   { debugId: 1,
     uri: 'http://journals.cambridge.org//abstract_S0030605316000028',
     method: 'GET',
     headers: 
      { referer: 'http://journals.cambridge.org//abstract_S0030605316000028',
        host: 'journals.cambridge.org' } } }
{ response: 
   { debugId: 1,
     headers: 
      { date: 'Tue, 14 Jun 2016 08:48:17 GMT',
        'content-type': 'text/html; charset=UTF-8',
        'transfer-encoding': 'chunked',
        connection: 'close',
        'set-cookie': [Object],
        pragma: 'no-cache',
        'x-frame-options': 'SAMEORIGIN',
        server: 'cloudflare-nginx',
        'cf-ray': '2b2c85b94b9f0a6c-LHR' },
     statusCode: 520,
petermr commented 8 years ago

Is the following useful? https://support.cloudflare.com/hc/en-us/articles/200171936-Error-520-Web-server-is-returning-an-unknown-error

tarrow commented 8 years ago

Unfortunately since we don't know the final origin there isn't much we can do. I'm now working on making thresher just move on from these errors to the next url.

petermr commented 8 years ago

Thanks - this is a useful first step. We need enough output that we can log this and - perhaps - create blacklists. e.g. if a publishers fails consistently - say 100/100 then it's a waste to continue. If they fail 50/100 it will depend on the cost of tiemouts. If it's 1/100 then it's worthwhile.

tarrow commented 7 years ago

This is almost certainly a duplicate of #62