daijro / hrequests

🚀 Web scraping for humans
https://daijro.gitbook.io/hrequests/
MIT License
621 stars 38 forks source link
forhumans gevent grequests http humans playwright playwright-python python python-requests requests scraping tls

hrequests

PyPI PyPI

Hrequests (human requests) is a simple, configurable, feature-rich, replacement for the Python requests library.

✨ Features

💻 Browser crawling

⚡ More


Installation

Install via pip:

pip install -U hrequests[all]
python -m hrequests install
Or, install without headless browsing support **Ignore the `[all]` option if you don't want headless browsing support:** ```bash pip install -U hrequests ```

Documentation

For the latest stable hrequests documentation, check the Gitbook page.

  1. Simple Usage
  2. Sessions
  3. Concurrent & Lazy Requests
  4. HTML Parsing
  5. Browser Automation

Simple Usage

Here is an example of a simple get request:

>>> resp = hrequests.get('https://www.google.com/')

Requests are sent through bogdanfinn's tls-client to spoof the TLS client fingerprint. This is done automatically, and is completely transparent to the user.

Other request methods include post, put, delete, head, options, and patch.

The Response object is a near 1:1 replica of the requests.Response object, with some additional attributes.

Parameters ``` Parameters: url (Union[str, Iterable[str]]): URL or list of URLs to request. data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None. files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None. headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None. params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None. cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None. json (dict, optional): Json to send in the request body. Defaults to None. allow_redirects (bool, optional): Allow request to redirect. Defaults to True. history (bool, optional): Remember request history. Defaults to False. verify (bool, optional): Verify the server's TLS certificate. Defaults to True. timeout (float, optional): Timeout in seconds. Defaults to 30. proxy (str, optional): Proxy URL. Defaults to None. nohup (bool, optional): Run the request in the background. Defaults to False. Returns: hrequests.response.Response: Response object ```

Properties

Get the response url:

>>> resp.url: str
'https://www.google.com/'

Check if the request was successful:

>>> resp.status_code: int
200
>>> resp.reason: str
'OK'
>>> resp.ok: bool
True
>>> bool(resp)
True

Getting the response body:

>>> resp.text: str
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.content: bytes
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.encoding: str
'UTF-8'

Parse the response body as JSON:

>>> resp.json(): Union[dict, list]
{'somedata': True}

Get the elapsed time of the request:

>>> resp.elapsed: datetime.timedelta
datetime.timedelta(microseconds=77768)

Get the response cookies:

>>> resp.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...

Get the response headers:

>>> resp.headers: CaseInsensitiveDict
{'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000', 'Cache-Control': 'private, max-age=0', 'Content-Encoding': 'br', 'Content-Length': '51288', 'Content-Security-Policy-Report-Only': "object-src 'none';base-uri 'se

Sessions

Creating a new Chrome Session object:

>>> session = hrequests.Session()  # version randomized by default
>>> session = hrequests.Session('chrome', version=120)
Parameters ``` Parameters: browser (Literal['firefox', 'chrome'], optional): Browser to use. Default is 'chrome'. version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized. os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized. headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`. verify (bool, optional): Verify the server's TLS certificate. Defaults to True. timeout (float, optional): Default timeout in seconds. Defaults to 30. proxy (str, optional): Proxy URL. Defaults to None. cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None. certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None. disable_ipv6 (bool, optional): Disable IPv6. Defaults to False. detect_encoding (bool, optional): Detect encoding. Defaults to True. ja3_string (str, optional): JA3 string. Defaults to None. h2_settings (dict, optional): HTTP/2 settings. Defaults to None. additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None. pseudo_header_order (list, optional): Pseudo header order. Defaults to None. priority_frames (list, optional): Priority frames. Defaults to None. header_order (list, optional): Header order. Defaults to None. force_http1 (bool, optional): Force HTTP/1. Defaults to False. catch_panics (bool, optional): Catch panics. Defaults to False. debug (bool, optional): Debug mode. Defaults to False. ```

Browsers can also be created through the firefox and chrome shortcuts:

>>> session = hrequests.firefox.Session()
>>> session = hrequests.chrome.Session()
Parameters ``` Parameters: version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized. os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized. headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`. verify (bool, optional): Verify the server's TLS certificate. Defaults to True. timeout (float, optional): Default timeout in seconds. Defaults to 30. proxy (str, optional): Proxy URL. Defaults to None. cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None. certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None. disable_ipv6 (bool, optional): Disable IPv6. Defaults to False. detect_encoding (bool, optional): Detect encoding. Defaults to True. ja3_string (str, optional): JA3 string. Defaults to None. h2_settings (dict, optional): HTTP/2 settings. Defaults to None. additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None. pseudo_header_order (list, optional): Pseudo header order. Defaults to None. priority_frames (list, optional): Priority frames. Defaults to None. header_order (list, optional): Header order. Defaults to None. force_http1 (bool, optional): Force HTTP/1. Defaults to False. catch_panics (bool, optional): Catch panics. Defaults to False. debug (bool, optional): Debug mode. Defaults to False. ```

os can be 'win', 'mac', or 'lin'. Default is randomized.

>>> session = hrequests.chrome.Session(os='mac')

This will automatically generate headers based on the browser name and OS:

>>> session.headers
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4; rv:60.2.2) Gecko/20100101 Firefox/60.2.2', 'Accept-Encoding': 'gzip, deflate, br', 'Pragma': 'no-cache'}
Why is the browser version in the header different than the TLS browser version? Website bot detection systems typically do not correlate the TLS fingerprint browser version with the browser header. By adding more randomization to our headers, we can make our requests appear to be coming from a larger number of clients. We can make it seem like our requests are coming from a larger number of clients. This makes it harder for websites to identify and block our requests based on a consistent browser version.

Properties

Here is a simple get request. This is a wrapper around hrequests.get. The only difference is that the session cookies are updated with each request. Creating sessions are recommended for making multiple requests to the same domain.

>>> resp = session.get('https://www.google.com/')

Session cookies update with each request:

>>> session.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...

Regenerate headers for a different OS:

>>> session.os = 'win'
>>> session.headers: CaseInsensitiveDict
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0.3) Gecko/20100101 Firefox/66.0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US;q=0.5,en;q=0.3', 'Cache-Control': 'max-age=0', 'DNT': '1', 'Upgrade-Insecure-Requests': '1', 'Pragma': 'no-cache'}

Closing Sessions

Sessions can also be closed to free memory:

>>> session.close()

Alternatively, sessions can be used as context managers:

with hrequests.Session() as session:
    resp = session.get('https://www.google.com/')
    print(resp)

Concurrent & Lazy Requests

Nohup Requests

Similar to Unix's nohup command, nohup requests are sent in the background.

Adding the nohup=True keyword argument will return a LazyTLSRequest object. This will send the request immediately, but doesn't wait for the response to be ready until an attribute of the response is accessed.

resp1 = hrequests.get('https://www.google.com/', nohup=True)
resp2 = hrequests.get('https://www.google.com/', nohup=True)

resp1 and resp2 are sent concurrently. They will never pause the current thread, unless an attribute of the response is accessed:

print('Resp 1:', resp1.reason)  # will wait for resp1 to finish, if it hasn't already
print('Resp 2:', resp2.reason)  # will wait for resp2 to finish, if it hasn't already

This is useful for sending requests in the background that aren't needed until later.

Note: In nohup, a new thread is created for each request. For larger scale concurrency, please consider the following:

Easy Concurrency

You can pass an array/iterator of links to the request methods to send them concurrently. This wraps around hrequests.map:

>>> hrequests.get(['https://google.com/', 'https://github.com/'])
(<Response [200]>, <Response [200]>)

This also works with nohup:

>>> resps = hrequests.get(['https://google.com/', 'https://github.com/'], nohup=True)
>>> resps
(<LazyResponse[Pending]>, <LazyResponse[Pending]>)
>>> # Sometime later...
>>> resps
(<Response [200]>, <Response [200]>)

Grequests-style Concurrency

The methods async_get, async_post, etc. will create an unsent request. This levereges gevent, making it blazing fast.

Parameters ``` Parameters: url (str): URL to send request to data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None. files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None. headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None. params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None. cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None. json (dict, optional): Json to send in the request body. Defaults to None. allow_redirects (bool, optional): Allow request to redirect. Defaults to True. history (bool, optional): Remember request history. Defaults to False. verify (bool, optional): Verify the server's TLS certificate. Defaults to True. timeout (float, optional): Timeout in seconds. Defaults to 30. proxy (str, optional): Proxy URL. Defaults to None. Returns: hrequests.response.Response: Response object ```

Async requests are evaluated on hrequests.map, hrequests.imap, or hrequests.imap_enum.

This functionality is similar to grequests. Unlike grequests, monkey patching is not required because this does not rely on the standard python SSL library.

Create a set of unsent Requests:

>>> reqs = [
...     hrequests.async_get('https://www.google.com/', browser='firefox'),
...     hrequests.async_get('https://www.duckduckgo.com/'),
...     hrequests.async_get('https://www.yahoo.com/')
... ]

map

Send them all at the same time using map:

>>> hrequests.map(reqs, size=3)
[<Response [200]>, <Response [200]>, <Response [200]>]
Parameters ``` Concurrently converts a list of Requests to Responses. Parameters: requests - a collection of Request objects. size - Specifies the number of requests to make at a time. If None, no throttling occurs. exception_handler - Callback function, called when exception occurred. Params: Request, Exception timeout - Gevent joinall timeout in seconds. (Note: unrelated to requests timeout) Returns: A list of Response objects. ```

imap

imap returns a generator that yields responses as they come in:

>>> for resp in hrequests.imap(reqs, size=3):
...    print(resp)
<Response [200]>
<Response [200]>
<Response [200]>
Parameters ``` Concurrently converts a generator object of Requests to a generator of Responses. Parameters: requests - a generator or sequence of Request objects. size - Specifies the number of requests to make at a time. default is 2 exception_handler - Callback function, called when exception occurred. Params: Request, Exception Yields: Response objects. ```

imap_enum returns a generator that yields a tuple of (index, response) as they come in. The index is the index of the request in the original list:

>>> for index, resp in hrequests.imap_enum(reqs, size=3):
...     print(index, resp)
(1, <Response [200]>)
(0, <Response [200]>)
(2, <Response [200]>)
Parameters ``` Like imap, but yields tuple of original request index and response object Unlike imap, failed results and responses from exception handlers that return None are not ignored. Instead, a tuple of (index, None) is yielded. Responses are still in arbitrary order. Parameters: requests - a sequence of Request objects. size - Specifies the number of requests to make at a time. default is 2 exception_handler - Callback function, called when exception occurred. Params: Request, Exception Yields: (index, Response) tuples. ```

Exception Handling

To handle timeouts or any other exception during the connection of the request, you can add an optional exception handler that will be called with the request and exception inside the main thread.

>>> def exception_handler(request, exception):
...    return f'Response failed: {exception}'

>>> bad_reqs = [
...     hrequests.async_get('http://httpbin.org/delay/5', timeout=1),
...     hrequests.async_get('http://fakedomain/'),
...     hrequests.async_get('http://example.com/'),
... ]
>>> hrequests.map(bad_reqs, size=3, exception_handler=exception_handler)
['Response failed: Connection error', 'Response failed: Connection error', <Response [200]>]

The value returned by the exception handler will be used in place of the response in the result list.

If an exception handler isn't specified, the default yield type is hrequests.FailedResponse.


HTML Parsing

HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.

Library Time (1e5 trials)
BeautifulSoup4 52.6
PyQuery 7.5
selectolax 1.9

The HTML parser can be accessed through the html attribute of the response object:

>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>

Parsing page

Grab a list of all links on the page, as-is (anchors excluded):

>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...

Search for text on the page:

>>> resp.html.search('Python is a {} language')[0]
programming

Selecting elements

Select an element using a CSS Selector:

>>> about = resp.html.find('#about')
Parameters ``` Given a CSS Selector, returns a list of :class:`Element ` objects or a single one. Parameters: selector: CSS Selector to use. clean: Whether or not to sanitize the found HTML of `` Githubissues.
  • Githubissues is a development platform for aggregating issues.