Closed kaklakariada closed 2 months ago
Are you sure it should be solved on driver level instead of changing proxy / DNS configuration?
I suspect it might cause HTTP transport to fail.
Maybe you could simply connect to proxy first and resolve DNS request using proxy? I guess it should be possible.
@littleK0i thank you for your response!
I am not sure if resolving hostnames through the proxy would help. The driver would still try to connect using the IP address through the proxy, and the proxy will still reject the request.
As far as I understand we need to pass the hostname instead of ipaddr
when creating the websocket connection:
websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)
To be honest I was very surprised when I learned that pyexasol resolves host names itself instead. I understand that the goal is to do load balancing by shuffling the IP addresses, but I would have expected that the network stack would already randomly pick one of the resolved IP addresses.
Resolving everything helps to address a few problems:
myexasol.mlan
instead of having separate hostnames myexasol1..64.mlan
, when number constantly changes. In this case redundancy will not work properly if hostname is not resolved beforehand, since we do not know if it points to one address or to multiple addresses.Maybe it would be easier to reconsider proxy configuration or use a different approach. I suspect only one client has this problem now after 6+ years.
Hi @littleK0i,
As usual, thanks a lot for the insightful feedback. Regarding the HTTP transport, do you have a specific scenario in mind which isn't covered by the examples/tests?
I have been talking to @kaklakariada directly to get a better understanding of the overall issue.
From what I understand, this is most commonly an issue with SaaS where whitelisting a DNS name rather than IPs provides more flexibility. Also, I think it's not guaranteed that the client code author always has full control over the proxy and/or the network stack. Therefore, not having this functionality puts some folks out of luck.
While I am a huge fan of Pyexasol's performance mindset, I also understand the need for the Exasol standard Python driver to be flexible in its usage. I generally approach decisions with the mindset that the default should be performant, but other scenarios can be supported if needed.
TL;DR:
I am fine with adding such functionality if it does not break existing code:
Additionally, I will add a "Design Doc" where I will document important design decisions, including their background, as in this case.
Hope that's reasonable to everyone. If not, feel free to get back to me.
Best,
Nico
I think it might be helpful to implement something similar to Snowflake NETWORK POLICIES
(https://docs.snowflake.com/en/user-guide/network-policies). Only clients using specific IP ranges are allowed to open connection.
In this case customers won't need to maintain extra rules on proxy, and the security will be much better overall. Also, this problem will be fixed automatically.
I would not be surprised if Exasol SaaS already has this feature in some shape or form.
I've looked through the code. I suspect HTTP transport (export_*
, import_*
functions) does not work with http_proxy
at this moment.
If you're going to recommend more customers to use proxy instead of VPN or private networks, it might be worthwhile to extend proxy support to HTTP transport. Not sure about performance & overhead for big data transfers.
Thanks @littleK0i for the additional hints and clarifications.
From what I understand, the NETWORK POLICIES
are part of the issue. In the case of Exasol SaaS, they can be created based on DNS names and/or IP addresses. Regarding DNS names, it's not working due to the forced resolution of the DNS name as far as I understand. This seems due to the fact that the validation also considers the connection string or so, if I understood @kaklakariada correctly.
I'll be offline for the rest of the week, but I would like to discuss this further with @kaklakariada given the new information you've provided. Additionally, I think we can also give @kaklakariada the opportunity to "prove" that a viable patch for this issue is feasible, if he can provide an appropriate PR:
@kaklakariada, it might also be a good idea to consider marking/documenting
the parameter (configuration point) as not being part of the stable API yet.
@littleK0i, we should also keep in mind that this project isn't static. If we add this feature and it does more harm than good, we are willing to remove or rework it.
Best, Nico
I should have provided some background. This feature is not meant for a customer but for an internal use case at our company. We need to implement a service that sits behind a proxy and accesses an Exasol SaaS instance. For security reasons this proxy allows connections only to certain host names and blocks access via IP addresses.
I will add integration tests that verify, that resolve_hostnames=True
works with import, export and other features.
Aha! This changes quite a lot. In theory, it should be possible to affect design of internal services.
How about using a basic VPC instead, mainly for performance reasons? Proxy would naturally add extra 20-100ms to every request, which makes user experience a bit worse.
Also, proxy is a natural bottleneck in terms of network throughput. Even if the whole cluster can send more data in parallel, it will be limited by proxy, unless it is a multi-node proxy.
I totally agree that a proxy is bad for performance. But in our use case we will have only a very small amount of data we query/insert. The goal of this feature is working around network restrictions and by no means a recommendation to users. I will clarify this in the documentation.
Summary
Pyexasol resolves hostnames to IP addresses before establishing a connection, see connection.py#L796. In case a host name resolves to multiple IP addresses of individual cluster nodes (e.g. for Exasol SaaS
abc123.clusters.exasol.com
) this explicitly implements load balancing.Details
There is one problem with this approach: When connecting through a proxy with an allow-list of host names, this will be a problem: The proxy only sees the IP address which is not on the allow-list and blocks the request. The connection will fail with error message
Could not connect to Exasol: failed CONNECT via proxy status: 403
.Background & Context
To solve this problem I propose to allow the user to deactivate host name resolution by adding a constructor argument
resolve_hostnames
toExaConnection
with default valueTrue
.Task(s)