hrequests
Hrequests (human requests) is a simple, configurable, feature-rich, replacement for the Python requests library.
✨ Features
- Seamless transition between HTTP and headless browsing 💻
- Integrated fast HTML parser 🚀
- High performance network concurrency with goroutines & gevent 🚀
- Replication of browser TLS fingerprints 🚀
- JavaScript rendering 🚀
- Supports HTTP/2 🚀
- Realistic browser header generation 🚀
- JSON serializing up to 10x faster than the standard library 🚀
💻 Browser crawling
- Simple & uncomplicated browser automation
- Human-like cursor movement and typing
- Chrome and Firefox extension support
- Full page screenshots
- Proxy support
- Headless and headful support
- No CORS restrictions
- Anti-detect browsing based on Vinyzu's Botright
⚡ More
- High performance ✨
- Minimal dependence on the python standard libraries
- HTTP backend written in Go
- Automatic gzip & brotli decode
- Written with type safety
- 100% threadsafe ❤️
Installation
Install via pip:
pip install -U hrequests[all]
python -m hrequests install
Or, install without headless browsing support
**Ignore the `[all]` option if you don't want headless browsing support:**
```bash
pip install -U hrequests
```
Documentation
For the latest stable hrequests documentation, check the Gitbook page.
- Simple Usage
- Sessions
- Concurrent & Lazy Requests
- HTML Parsing
- Browser Automation
Simple Usage
Here is an example of a simple get
request:
>>> resp = hrequests.get('https://www.google.com/')
Requests are sent through bogdanfinn's tls-client to spoof the TLS client fingerprint. This is done automatically, and is completely transparent to the user.
Other request methods include post
, put
, delete
, head
, options
, and patch
.
The Response
object is a near 1:1 replica of the requests.Response
object, with some additional attributes.
Parameters
```
Parameters:
url (Union[str, Iterable[str]]): URL or list of URLs to request.
data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None.
files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None.
headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None.
params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None.
cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None.
json (dict, optional): Json to send in the request body. Defaults to None.
allow_redirects (bool, optional): Allow request to redirect. Defaults to True.
history (bool, optional): Remember request history. Defaults to False.
verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
timeout (float, optional): Timeout in seconds. Defaults to 30.
proxy (str, optional): Proxy URL. Defaults to None.
nohup (bool, optional): Run the request in the background. Defaults to False.
Returns:
hrequests.response.Response: Response object
```
Properties
Get the response url:
>>> resp.url: str
'https://www.google.com/'
Check if the request was successful:
>>> resp.status_code: int
200
>>> resp.reason: str
'OK'
>>> resp.ok: bool
True
>>> bool(resp)
True
Getting the response body:
>>> resp.text: str
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.content: bytes
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.encoding: str
'UTF-8'
Parse the response body as JSON:
>>> resp.json(): Union[dict, list]
{'somedata': True}
Get the elapsed time of the request:
>>> resp.elapsed: datetime.timedelta
datetime.timedelta(microseconds=77768)
Get the response cookies:
>>> resp.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...
Get the response headers:
>>> resp.headers: CaseInsensitiveDict
{'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000', 'Cache-Control': 'private, max-age=0', 'Content-Encoding': 'br', 'Content-Length': '51288', 'Content-Security-Policy-Report-Only': "object-src 'none';base-uri 'se
Sessions
Creating a new Chrome Session object:
>>> session = hrequests.Session() # version randomized by default
>>> session = hrequests.Session('chrome', version=120)
Parameters
```
Parameters:
browser (Literal['firefox', 'chrome'], optional): Browser to use. Default is 'chrome'.
version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized.
os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized.
headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`.
verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
timeout (float, optional): Default timeout in seconds. Defaults to 30.
proxy (str, optional): Proxy URL. Defaults to None.
cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None.
certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None.
disable_ipv6 (bool, optional): Disable IPv6. Defaults to False.
detect_encoding (bool, optional): Detect encoding. Defaults to True.
ja3_string (str, optional): JA3 string. Defaults to None.
h2_settings (dict, optional): HTTP/2 settings. Defaults to None.
additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None.
pseudo_header_order (list, optional): Pseudo header order. Defaults to None.
priority_frames (list, optional): Priority frames. Defaults to None.
header_order (list, optional): Header order. Defaults to None.
force_http1 (bool, optional): Force HTTP/1. Defaults to False.
catch_panics (bool, optional): Catch panics. Defaults to False.
debug (bool, optional): Debug mode. Defaults to False.
```
Browsers can also be created through the firefox
and chrome
shortcuts:
>>> session = hrequests.firefox.Session()
>>> session = hrequests.chrome.Session()
Parameters
```
Parameters:
version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized.
os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized.
headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`.
verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
timeout (float, optional): Default timeout in seconds. Defaults to 30.
proxy (str, optional): Proxy URL. Defaults to None.
cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None.
certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None.
disable_ipv6 (bool, optional): Disable IPv6. Defaults to False.
detect_encoding (bool, optional): Detect encoding. Defaults to True.
ja3_string (str, optional): JA3 string. Defaults to None.
h2_settings (dict, optional): HTTP/2 settings. Defaults to None.
additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None.
pseudo_header_order (list, optional): Pseudo header order. Defaults to None.
priority_frames (list, optional): Priority frames. Defaults to None.
header_order (list, optional): Header order. Defaults to None.
force_http1 (bool, optional): Force HTTP/1. Defaults to False.
catch_panics (bool, optional): Catch panics. Defaults to False.
debug (bool, optional): Debug mode. Defaults to False.
```
os
can be 'win'
, 'mac'
, or 'lin'
. Default is randomized.
>>> session = hrequests.chrome.Session(os='mac')
This will automatically generate headers based on the browser name and OS:
>>> session.headers
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4; rv:60.2.2) Gecko/20100101 Firefox/60.2.2', 'Accept-Encoding': 'gzip, deflate, br', 'Pragma': 'no-cache'}
Why is the browser version in the header different than the TLS browser version?
Website bot detection systems typically do not correlate the TLS fingerprint browser version with the browser header.
By adding more randomization to our headers, we can make our requests appear to be coming from a larger number of clients. We can make it seem like our requests are coming from a larger number of clients. This makes it harder for websites to identify and block our requests based on a consistent browser version.
Properties
Here is a simple get request. This is a wrapper around hrequests.get
. The only difference is that the session cookies are updated with each request. Creating sessions are recommended for making multiple requests to the same domain.
>>> resp = session.get('https://www.google.com/')
Session cookies update with each request:
>>> session.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...
Regenerate headers for a different OS:
>>> session.os = 'win'
>>> session.headers: CaseInsensitiveDict
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0.3) Gecko/20100101 Firefox/66.0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US;q=0.5,en;q=0.3', 'Cache-Control': 'max-age=0', 'DNT': '1', 'Upgrade-Insecure-Requests': '1', 'Pragma': 'no-cache'}
Closing Sessions
Sessions can also be closed to free memory:
>>> session.close()
Alternatively, sessions can be used as context managers:
with hrequests.Session() as session:
resp = session.get('https://www.google.com/')
print(resp)
Concurrent & Lazy Requests
Nohup Requests
Similar to Unix's nohup command, nohup
requests are sent in the background.
Adding the nohup=True
keyword argument will return a LazyTLSRequest
object. This will send the request immediately, but doesn't wait for the response to be ready until an attribute of the response is accessed.
resp1 = hrequests.get('https://www.google.com/', nohup=True)
resp2 = hrequests.get('https://www.google.com/', nohup=True)
resp1
and resp2
are sent concurrently. They will never pause the current thread, unless an attribute of the response is accessed:
print('Resp 1:', resp1.reason) # will wait for resp1 to finish, if it hasn't already
print('Resp 2:', resp2.reason) # will wait for resp2 to finish, if it hasn't already
This is useful for sending requests in the background that aren't needed until later.
Note: In nohup
, a new thread is created for each request. For larger scale concurrency, please consider the following:
Easy Concurrency
You can pass an array/iterator of links to the request methods to send them concurrently. This wraps around hrequests.map
:
>>> hrequests.get(['https://google.com/', 'https://github.com/'])
(<Response [200]>, <Response [200]>)
This also works with nohup
:
>>> resps = hrequests.get(['https://google.com/', 'https://github.com/'], nohup=True)
>>> resps
(<LazyResponse[Pending]>, <LazyResponse[Pending]>)
>>> # Sometime later...
>>> resps
(<Response [200]>, <Response [200]>)
Grequests-style Concurrency
The methods async_get
, async_post
, etc. will create an unsent request. This levereges gevent, making it blazing fast.
Parameters
```
Parameters:
url (str): URL to send request to
data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None.
files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None.
headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None.
params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None.
cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None.
json (dict, optional): Json to send in the request body. Defaults to None.
allow_redirects (bool, optional): Allow request to redirect. Defaults to True.
history (bool, optional): Remember request history. Defaults to False.
verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
timeout (float, optional): Timeout in seconds. Defaults to 30.
proxy (str, optional): Proxy URL. Defaults to None.
Returns:
hrequests.response.Response: Response object
```
Async requests are evaluated on hrequests.map
, hrequests.imap
, or hrequests.imap_enum
.
This functionality is similar to grequests. Unlike grequests, monkey patching is not required because this does not rely on the standard python SSL library.
Create a set of unsent Requests:
>>> reqs = [
... hrequests.async_get('https://www.google.com/', browser='firefox'),
... hrequests.async_get('https://www.duckduckgo.com/'),
... hrequests.async_get('https://www.yahoo.com/')
... ]
map
Send them all at the same time using map:
>>> hrequests.map(reqs, size=3)
[<Response [200]>, <Response [200]>, <Response [200]>]
Parameters
```
Concurrently converts a list of Requests to Responses.
Parameters:
requests - a collection of Request objects.
size - Specifies the number of requests to make at a time. If None, no throttling occurs.
exception_handler - Callback function, called when exception occurred. Params: Request, Exception
timeout - Gevent joinall timeout in seconds. (Note: unrelated to requests timeout)
Returns:
A list of Response objects.
```
imap
imap
returns a generator that yields responses as they come in:
>>> for resp in hrequests.imap(reqs, size=3):
... print(resp)
<Response [200]>
<Response [200]>
<Response [200]>
Parameters
```
Concurrently converts a generator object of Requests to a generator of Responses.
Parameters:
requests - a generator or sequence of Request objects.
size - Specifies the number of requests to make at a time. default is 2
exception_handler - Callback function, called when exception occurred. Params: Request, Exception
Yields:
Response objects.
```
imap_enum
returns a generator that yields a tuple of (index, response)
as they come in. The index
is the index of the request in the original list:
>>> for index, resp in hrequests.imap_enum(reqs, size=3):
... print(index, resp)
(1, <Response [200]>)
(0, <Response [200]>)
(2, <Response [200]>)
Parameters
```
Like imap, but yields tuple of original request index and response object
Unlike imap, failed results and responses from exception handlers that return None are not ignored. Instead, a
tuple of (index, None) is yielded.
Responses are still in arbitrary order.
Parameters:
requests - a sequence of Request objects.
size - Specifies the number of requests to make at a time. default is 2
exception_handler - Callback function, called when exception occurred. Params: Request, Exception
Yields:
(index, Response) tuples.
```
Exception Handling
To handle timeouts or any other exception during the connection of the request, you can add an optional exception handler that will be called with the request and exception inside the main thread.
>>> def exception_handler(request, exception):
... return f'Response failed: {exception}'
>>> bad_reqs = [
... hrequests.async_get('http://httpbin.org/delay/5', timeout=1),
... hrequests.async_get('http://fakedomain/'),
... hrequests.async_get('http://example.com/'),
... ]
>>> hrequests.map(bad_reqs, size=3, exception_handler=exception_handler)
['Response failed: Connection error', 'Response failed: Connection error', <Response [200]>]
The value returned by the exception handler will be used in place of the response in the result list.
If an exception handler isn't specified, the default yield type is hrequests.FailedResponse
.
HTML Parsing
HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.
Library |
Time (1e5 trials) |
BeautifulSoup4 |
52.6 |
PyQuery |
7.5 |
selectolax |
1.9 |
The HTML parser can be accessed through the html
attribute of the response object:
>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>
Parsing page
Grab a list of all links on the page, as-is (anchors excluded):
>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...
Grab a list of all links on the page, in absolute form (anchors excluded):
>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...
Search for text on the page:
>>> resp.html.search('Python is a {} language')[0]
programming
Selecting elements
Select an element using a CSS Selector:
>>> about = resp.html.find('#about')
Parameters
```
Given a CSS Selector, returns a list of
:class:`Element ` objects or a single one.
Parameters:
selector: CSS Selector to use.
clean: Whether or not to sanitize the found HTML of `` Githubissues.
Githubissues is a development platform for aggregating issues.