abhinavsingh / proxy.py

💫 Ngrok FRP Alternative • ⚡ Fast • 🪶 Lightweight • 0️⃣ Dependency • 🔌 Pluggable • 😈 TLS interception • 🔒 DNS-over-HTTPS • 🔥 Poor Man's VPN • ⏪ Reverse & ⏩ Forward • 👮🏿 "Proxy Server" framework • 🌐 "Web Server" framework • ➵ ➶ ➷ ➠ "PubSub" framework • 👷 "Work" acceptor & executor framework
https://abhinavsingh.com/proxy-py-a-lightweight-single-file-http-proxy-server-in-python/
BSD 3-Clause "New" or "Revised" License
3.04k stars 577 forks source link

Inconsistent HttpParser fields #1437

Open JJ-Author opened 2 months ago

JJ-Author commented 2 months ago

Describe the bug the field values of the request object (for a GET request via http and https) vary between http and https (intercepted) in an inconsistent and potentially incorrect manner that is hindering correct implementation of a custom archive redirection plugin

To Reproduce the issue can be reproduced with

  1. running poetry install and then poetry shell and then python proxy/request_proxy.py in the root dir of https://github.com/kuefmz/https-interception-proxypy/
  2. curl -x http://0.0.0.0:8899/ --cacert ca-cert.pem https://www.w3id.org/simulation/ontology/
  3. curl -x http://0.0.0.0:8899/ --cacert ca-cert.pem http://www.w3id.org/simulation/ontology/

Expected behavior

Version information

Additional context

Log from the custom proxy with output of request object and selection of fields

2024-08-02 15:51:26,940 - pid:209673 [I] plugins.load:89 - Loaded plugin proxy.http.proxy.HttpProxyPlugin
2024-08-02 15:51:26,940 - pid:209673 [I] plugins.load:89 - Loaded plugin __main__.RequestPlugin
Request _url: www.w3id.org:443
Request.method: b'CONNECT'
Request protocol: None
Request host: b'www.w3id.org'
Request path: None
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': b'www.w3id.org', 'port': 443, 'path': None, 'method': b'CONNECT', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 116, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org:443'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'proxy-connection': (b'Proxy-Connection', b'Keep-Alive')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': True}

Request _url: /simulation/ontology/
Request.method: b'GET'
Request protocol: None
Request host: None
Request path: b'/simulation/ontology/'
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': None, 'port': 80, 'path': b'/simulation/ontology/', 'method': b'GET', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 96, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'accept': (b'Accept', b'*/*')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': False}
2024-08-02 15:51:35,421 - pid:209683 [I] server.access_log:388 - 127.0.0.1:59292 - CONNECT www.w3id.org:443 - 683 bytes - 680.55ms

Request _url: http://www.w3id.org/simulation/ontology/
Request.method: b'GET'
Request protocol: None
Request host: b'www.w3id.org'
Request path: b'/simulation/ontology/'
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': b'www.w3id.org', 'port': 80, 'path': b'/simulation/ontology/', 'method': b'GET', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 145, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'accept': (b'Accept', b'*/*'), b'proxy-connection': (b'Proxy-Connection', b'Keep-Alive')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': False}
2024-08-02 15:51:44,094 - pid:209688 [I] server.access_log:388 - 127.0.0.1:49428 - GET www.w3id.org:80/simulation/ontology/ - 301 Moved Permanently - 573 bytes - 479.09ms

abhinavsingh commented 2 months ago

@JJ-Author Unfortunately that's how it is currently. TLS interception was an after-thought and added on community request. It didn't exist in original open source version. As a result , its support was just monkey patched on top of existing request object. In its defence, I can say that, since code runs within a context capable of serving HTTP, HTTPS, PROXY, INTERCEPTED content, a single request fails to encapsulates everything in a consistent manner. May be in future releases we can carve out more specific request objects to provide better interface.

abhinavsingh commented 2 months ago

IIRC, these issues are side-effect of how http parser works. To understand this, let's first remember proxy.py originally was just a proxy server. Hence, its internal parser takes the 1st request line and parses it. E.g. for a typical proxy request, 1st line will look like CONNECT httpbin.org:443 HTTP/1.1

We may need to dig further into this to provide a consistent interface. Due to backwards compatibility we cannot change the existing, but we can certainly add a helper method / property / attribute within parser object to provide better information for the current context (web, proxy, intercept)

JJ-Author commented 2 months ago

Thank you for your explanations. I understand and feel the pain when stuff was hacked in later (especially if the misbehavior was introduced by a community PR and wasn't noticed so far) and now a major redesign is needed but this contradicts API backward compatibility… But to be honest I can't follow some of your statements. To me it seemed that having the host “None” would only make sense when it is an http 1.0 request and the Host header is missing so this should have nothing to do with https interception. But given your example with the CONNECT request I figured that it is not representing information from the host header which was kind of surprising given its field name "host" (but I now understand the rationale behind it according to the spec). But still to me it seems that the problem with the "empty host" was already introduced when implementing the CONNECT functionality (by not exposing information of the tunnel to this abstracted request object and embedding that into this abstracted host field). Yet, I don't understand why there would be a protocol field which is None and an _is_https_tunnell field which is only true for the CONNECT request that is supposed to create the tunnel opposed to its intuitive meaning indicating requests going through the https tunnel. . Since the protocol field seems not useful in the context of http and https requests I also do not understand why e.g. setting this to http or https respectively would affect backward compatibility because there should be no existing code trying to interpret an unused field anyhow.

Nevertheless the port situation is a bug where there seems no workaround for - there seems no way to distinguish connections between https://example.org:443 and https://example.org:444/ and perform appropriate redirection for these different request targets.

As a constructive feedback from our side (we value all effort that has been put so far in this project). ATM It seems hardly possible to write a plugin from the documentation or API hook definition. One needs to go through all the example plugins to grasp pieces of information how they can be used and then debug or "reverse-engineer" the values of the request object fields for the different request cases since with simple try and error you will get stuck because the behavior is unreasonable from the outside (we know that implementation limitations provide some reason but this can’t and should not be seen by plugin authors). Besides a clear documentation what an implementer can expect from the request object also a general request lifeycle/flow image that shows when which hook is triggered would be very helpful.

JJ-Author commented 2 months ago

Moving forward - based on your message - I think an entirely new request wrapper object that is linked from the current request object as a new field could be a good tradeoff. Given well-defined/typed getter (and if needed setter methods) with good documentation of each function and their return values (e.g. when values are optional) should allow people that are not http(s) experts like us to get started much easier, but also experts to write code faster and more robust. One option would be (probably this is more java then python style) that this request object could be based on a class hierarchy where each different class represents the different request types base, http, httpsPassthrough, httpsIntercepted, etc. As such one could get rid of optional/None values and it is very clear what you will get without too much documentation (and probably better IDE support).

JJ-Author commented 2 months ago

how do we proceed about this? the tag "question" you assigned seems inappropriate and unfortunately none of my questions (marked with a question mark) were actually answered.

split into 3 new issues?

abhinavsingh commented 2 months ago

split into 3 new issues?

  • [ ] lifecycle documentation as "enhancement" and "docu"

I honestly think this Python Notebook already helps explain what to expect from HttpParser class. I have sent out a PR to add more clarity, but more of less, this notebook already seem to document all scenarios.

We can further cover scenarios for: 1) Proxy requests under TLS interception 2) Reverse proxy requests

  • [ ] port issue as "bug"

Can you please reproduce this bug as either a test case (see test_http_parser.py) or may be via an example in the notebook?

  • [ ] consistent wrapper object as "enhancement"

Once we have enough clarity, we can propose an interface for various types of request objects, extending base HttpParser. Wdyt?