Inconsistent HttpParser fields

JJ-Author commented 2 months ago

Describe the bug the field values of the request object (for a GET request via http and https) vary between http and https (intercepted) in an inconsistent and potentially incorrect manner that is hindering correct implementation of a custom archive redirection plugin

protocol is None always -> is this supposed to carry scheme information? how to distinguish whether the request is https or not???
_url is incomplete (missing scheme+host) for https requests
host is None for https request
port is incorrectly reported as 80 for https request (that is actually sent to 443)

To Reproduce the issue can be reproduced with

running poetry install and then poetry shell and then python proxy/request_proxy.py in the root dir of https://github.com/kuefmz/https-interception-proxypy/
curl -x http://0.0.0.0:8899/ --cacert ca-cert.pem https://www.w3id.org/simulation/ontology/
curl -x http://0.0.0.0:8899/ --cacert ca-cert.pem http://www.w3id.org/simulation/ontology/

Expected behavior

protocol should not be None but http or https?
_url should be full request url (including scheme+host) for https requests that are intercepted
host should be equivalent to the FQDN for https request as they are for http request
port should be correctly reported

Version information

OS: [ubuntu 20.04]
Browser [curl]
Device: [amd64]
proxy.py Version [2.4.4]

Additional context

Log from the custom proxy with output of request object and selection of fields

2024-08-02 15:51:26,940 - pid:209673 [I] plugins.load:89 - Loaded plugin proxy.http.proxy.HttpProxyPlugin
2024-08-02 15:51:26,940 - pid:209673 [I] plugins.load:89 - Loaded plugin __main__.RequestPlugin
Request _url: www.w3id.org:443
Request.method: b'CONNECT'
Request protocol: None
Request host: b'www.w3id.org'
Request path: None
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': b'www.w3id.org', 'port': 443, 'path': None, 'method': b'CONNECT', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 116, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org:443'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'proxy-connection': (b'Proxy-Connection', b'Keep-Alive')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': True}

Request _url: /simulation/ontology/
Request.method: b'GET'
Request protocol: None
Request host: None
Request path: b'/simulation/ontology/'
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': None, 'port': 80, 'path': b'/simulation/ontology/', 'method': b'GET', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 96, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'accept': (b'Accept', b'*/*')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': False}
2024-08-02 15:51:35,421 - pid:209683 [I] server.access_log:388 - 127.0.0.1:59292 - CONNECT www.w3id.org:443 - 683 bytes - 680.55ms

Request _url: http://www.w3id.org/simulation/ontology/
Request.method: b'GET'
Request protocol: None
Request host: b'www.w3id.org'
Request path: b'/simulation/ontology/'
Request properties: {'state': 6, 'type': 1, 'protocol': None, 'host': b'www.w3id.org', 'port': 80, 'path': b'/simulation/ontology/', 'method': b'GET', 'code': None, 'reason': None, 'version': b'HTTP/1.1', 'total_size': 145, 'buffer': None, 'headers': {b'host': (b'Host', b'www.w3id.org'), b'user-agent': (b'User-Agent', b'curl/7.81.0'), b'accept': (b'Accept', b'*/*'), b'proxy-connection': (b'Proxy-Connection', b'Keep-Alive')}, 'body': None, 'chunk': None, '_url': , '_is_chunked_encoded': False, '_content_expected': False, '_is_https_tunnel': False}
2024-08-02 15:51:44,094 - pid:209688 [I] server.access_log:388 - 127.0.0.1:49428 - GET www.w3id.org:80/simulation/ontology/ - 301 Moved Permanently - 573 bytes - 479.09ms

abhinavsingh commented 2 months ago

@JJ-Author Unfortunately that's how it is currently. TLS interception was an after-thought and added on community request. It didn't exist in original open source version. As a result , its support was just monkey patched on top of existing request object. In its defence, I can say that, since code runs within a context capable of serving HTTP, HTTPS, PROXY, INTERCEPTED content, a single request fails to encapsulates everything in a consistent manner. May be in future releases we can carve out more specific request objects to provide better interface.

abhinavsingh commented 2 months ago

IIRC, these issues are side-effect of how http parser works. To understand this, let's first remember proxy.py originally was just a proxy server. Hence, its internal parser takes the 1st request line and parses it. E.g. for a typical proxy request, 1st line will look like CONNECT httpbin.org:443 HTTP/1.1

We may need to dig further into this to provide a consistent interface. Due to backwards compatibility we cannot change the existing, but we can certainly add a helper method / property / attribute within parser object to provide better information for the current context (web, proxy, intercept)

JJ-Author commented 2 months ago

Thank you for your explanations. I understand and feel the pain when stuff was hacked in later (especially if the misbehavior was introduced by a community PR and wasn't noticed so far) and now a major redesign is needed but this contradicts API backward compatibility… But to be honest I can't follow some of your statements. To me it seemed that having the host “None” would only make sense when it is an http 1.0 request and the Host header is missing so this should have nothing to do with https interception. But given your example with the CONNECT request I figured that it is not representing information from the host header which was kind of surprising given its field name "host" (but I now understand the rationale behind it according to the spec). But still to me it seems that the problem with the "empty host" was already introduced when implementing the CONNECT functionality (by not exposing information of the tunnel to this abstracted request object and embedding that into this abstracted host field). Yet, I don't understand why there would be a protocol field which is None and an _is_https_tunnell field which is only true for the CONNECT request that is supposed to create the tunnel opposed to its intuitive meaning indicating requests going through the https tunnel. . Since the protocol field seems not useful in the context of http and https requests I also do not understand why e.g. setting this to http or https respectively would affect backward compatibility because there should be no existing code trying to interpret an unused field anyhow.

Nevertheless the port situation is a bug where there seems no workaround for - there seems no way to distinguish connections between https://example.org:443 and https://example.org:444/ and perform appropriate redirection for these different request targets.

As a constructive feedback from our side (we value all effort that has been put so far in this project). ATM It seems hardly possible to write a plugin from the documentation or API hook definition. One needs to go through all the example plugins to grasp pieces of information how they can be used and then debug or "reverse-engineer" the values of the request object fields for the different request cases since with simple try and error you will get stuck because the behavior is unreasonable from the outside (we know that implementation limitations provide some reason but this can’t and should not be seen by plugin authors). Besides a clear documentation what an implementer can expect from the request object also a general request lifeycle/flow image that shows when which hook is triggered would be very helpful.

JJ-Author commented 2 months ago

Moving forward - based on your message - I think an entirely new request wrapper object that is linked from the current request object as a new field could be a good tradeoff. Given well-defined/typed getter (and if needed setter methods) with good documentation of each function and their return values (e.g. when values are optional) should allow people that are not http(s) experts like us to get started much easier, but also experts to write code faster and more robust. One option would be (probably this is more java then python style) that this request object could be based on a class hierarchy where each different class represents the different request types base, http, httpsPassthrough, httpsIntercepted, etc. As such one could get rid of optional/None values and it is very clear what you will get without too much documentation (and probably better IDE support).

JJ-Author commented 2 months ago

how do we proceed about this? the tag "question" you assigned seems inappropriate and unfortunately none of my questions (marked with a question mark) were actually answered.

split into 3 new issues?

[ ] port issue as "bug"
[ ] consistent wrapper object as "enhancement"
[ ] lifecycle documentation as "enhancement" and "docu"

abhinavsingh commented 2 months ago

split into 3 new issues?

[ ] lifecycle documentation as "enhancement" and "docu"

I honestly think this Python Notebook already helps explain what to expect from HttpParser class. I have sent out a PR to add more clarity, but more of less, this notebook already seem to document all scenarios.

We can further cover scenarios for: 1) Proxy requests under TLS interception 2) Reverse proxy requests

[ ] port issue as "bug"

Can you please reproduce this bug as either a test case (see test_http_parser.py) or may be via an example in the notebook?

[ ] consistent wrapper object as "enhancement"

Once we have enough clarity, we can propose an interface for various types of request objects, extending base HttpParser. Wdyt?

abhinavsingh / proxy.py

Inconsistent HttpParser fields #1437