daneads / pypatent

Search for and retrieve US Patent and Trademark Office Patent Data
GNU General Public License v3.0
71 stars 19 forks source link

ConnectionError: ('Connection aborted.', BadStatusLine('Error #2000\n',)) #5

Open katelynstenger opened 5 years ago

katelynstenger commented 5 years ago

My script iterates through a list of patents I want to collect information on. I initially received this error: Exception is: ('Connection aborted.', error(10054, '')) I introduced a time.sleep(2) between calls of pypatent.Search function and remediated this error.

In the 5th iteration of pypatent.Search() , I received this error: ConnectionError: ('Connection aborted.', BadStatusLine('Error #2000\n',))

Any suggestions on remediating this error? Thank you for your help in advance!

katelynstenger commented 5 years ago

Here is the total error message:


BadStatusLine Traceback (most recent call last) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 600 body=body, headers=headers, --> 601 chunked=chunked) 602

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 386 # otherwise it looks like a programming error was the cause. --> 387 six.raise_from(e, None) 388 except (SocketTimeout, BaseSSLError, SocketError) as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\packages\six.py in raise_from(value, from_value)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 382 try: --> 383 httplib_response = conn.getresponse() 384 except Exception as e:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in getresponse(self) 1330 try: -> 1331 response.begin() 1332 except ConnectionError:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in begin(self) 296 while True: --> 297 version, status, reason = self._read_status() 298 if status != CONTINUE:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in _read_status(self) 278 self._close_conn() --> 279 raise BadStatusLine(line) 280

BadStatusLine: Error #2000

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 439 retries=self.max_retries, --> 440 timeout=timeout 441 )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 638 retries = retries.increment(method, url, error=e, _pool=self, --> 639 _stacktrace=sys.exc_info()[2]) 640 retries.sleep()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 356 if read is False or not self._is_method_retryable(method): --> 357 raise six.reraise(type(error), error, _stacktrace) 358 elif read is not None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\packages\six.py in reraise(tp, value, tb) 684 if value.traceback is not tb: --> 685 raise value.with_traceback(tb) 686 raise value

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 600 body=body, headers=headers, --> 601 chunked=chunked) 602

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 386 # otherwise it looks like a programming error was the cause. --> 387 six.raise_from(e, None) 388 except (SocketTimeout, BaseSSLError, SocketError) as e:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\packages\six.py in raise_from(value, from_value)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 382 try: --> 383 httplib_response = conn.getresponse() 384 except Exception as e:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in getresponse(self) 1330 try: -> 1331 response.begin() 1332 except ConnectionError:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in begin(self) 296 while True: --> 297 version, status, reason = self._read_status() 298 if status != CONTINUE:

~\AppData\Local\Continuum\anaconda3\lib\http\client.py in _read_status(self) 278 self._close_conn() --> 279 raise BadStatusLine(line) 280

ProtocolError: ('Connection aborted.', BadStatusLine('Error #2000\n',))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)

in () 13 for j in range(df.shape[1]): 14 for i in range(1): ---> 15 Patent_info(df.iloc[i, j]) 16 time.sleep(2) 17 in Patent_info(patent_number) 5 6 try: ----> 7 results = pyp.Search(patent_number).as_dataframe() 8 9 # reindex ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pypatent\__init__.py in __init__(self, string, results_limit, get_patent_details, pn, isd, ttl, abst, aclm, spec, ccl, cpc, cpcl, icl, apn, apd, apt, govt, fmid, parn, rlap, rlfd, prir, prad, pct, ptad, pt3d, pppd, reis, rpaf, afff, afft, in_, ic, is_, icn, aanm, aaci, aast, aaco, aaat, lrep, an, ac, as_, acn, exp, exa, ref, fref, oref, cofc, reex, ptab, sec, ilrn, ilrd, ilpd, ilfd) 260 while (num_results_fetched < total_results) and (num_results_fetched < results_limit): 261 this_url = url_pre + str(list_num) + url_post --> 262 thispatents = self.get_patents_from_results_url(this_url) 263 patents.extend(thispatents) 264 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pypatent\__init__.py in get_patents_from_results_url(self, url, limit) 273 274 def get_patents_from_results_url(self, url: str, limit: int = None) -> list: --> 275 r = requests.get(url, headers=Constants.request_header).text 276 s = BeautifulSoup(r, 'html.parser') 277 patents_raw = s.find_all('a', href=re.compile('netacgi')) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\api.py in get(url, params, **kwargs) 70 71 kwargs.setdefault('allow_redirects', True) ---> 72 return request('get', url, params=params, **kwargs) 73 74 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\api.py in request(method, url, **kwargs) 56 # cases, and look like a memory leak in others. 57 with sessions.Session() as session: ---> 58 return session.request(method=method, url=url, **kwargs) 59 60 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 506 } 507 send_kwargs.update(settings) --> 508 resp = self.send(prep, **send_kwargs) 509 510 return resp ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\sessions.py in send(self, request, **kwargs) 616 617 # Send the request --> 618 r = adapter.send(request, **kwargs) 619 620 # Total elapsed time of the request (approximately) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 488 489 except (ProtocolError, socket.error) as err: --> 490 raise ConnectionError(err, request=request) 491 492 except MaxRetryError as e: ConnectionError: ('Connection aborted.', BadStatusLine('Error #2000\n',))
daneads commented 5 years ago

@katelynstenger Thanks for submitting this issue, I get connection errors too. My guess is they've introduced rate limiting on the site. I'll take a look, introduce time.sleep(), and troubleshoot from there.

jhc154 commented 4 years ago

@daneads @katelynstenger I am wondering if time.sleep() was ever introduced as part of pypatents? When I got started with this library, I encountered issues when attempting to retrieve large amounts of data. I did not dig too deep but I thought that the rate-limiting might still be an issue.

I first noticed that using selenium really helped but then I found this page and found your idea interesting.

I tested introducing sleep(0.5) in on line 328 after the patents.append(p); under the get_patents_from_results_url. Also, on line 8,from time import sleep The results seem promising by adding sleep(); however, I'm not sure if this the best place to use the function. There is an obvious time tradeoff, it runs longer, but the search seems to work since it looks like it is easier on the server.

Testing for time.sleep() performance:

  1. Without time.sleep(), run pypatent.Search('crispr', results_limit=test, get_patent_details=True, web_connection=conn) at varying results_limits (where test = 500, 200, and 5)

    • results_limit = 500: failed, server error observed "error 2000...process terminated abnormally... document may be truncated" errors; did not look like the browser recovered - interrupted the kernel.
  2. With edits to introducetime.sleep(0.5), run the same searches.

    • results_limit = 500: CPU times: user 31.3 s, sys: 318 ms, total: 31.7 s; Wall time: 16min 41s observed some of the same error 2000 errors but the search was able to keep running. *observed some empy pages
  3. From Mac, Chrome, Jupyter Notebook, Python 3.7.3

btw, thank you so much for this library!