kaklakariada commented 2 months ago

Summary

Pyexasol resolves hostnames to IP addresses before establishing a connection, see connection.py#L796. In case a host name resolves to multiple IP addresses of individual cluster nodes (e.g. for Exasol SaaS abc123.clusters.exasol.com) this explicitly implements load balancing.

Details

There is one problem with this approach: When connecting through a proxy with an allow-list of host names, this will be a problem: The proxy only sees the IP address which is not on the allow-list and blocks the request. The connection will fail with error message Could not connect to Exasol: failed CONNECT via proxy status: 403.

Background & Context

To solve this problem I propose to allow the user to deactivate host name resolution by adding a constructor argument resolve_hostnames to ExaConnection with default value True.

Task(s)

[ ] Add option
[ ] Add unit tests
[ ] Add description to user guide

littleK0i commented 2 months ago

Are you sure it should be solved on driver level instead of changing proxy / DNS configuration?

I suspect it might cause HTTP transport to fail.

littleK0i commented 2 months ago

Maybe you could simply connect to proxy first and resolve DNS request using proxy? I guess it should be possible.

kaklakariada commented 2 months ago

@littleK0i thank you for your response! I am not sure if resolving hostnames through the proxy would help. The driver would still try to connect using the IP address through the proxy, and the proxy will still reject the request. As far as I understand we need to pass the hostname instead of ipaddr when creating the websocket connection:

websocket.create_connection(f'{ws_prefix}{ipaddr}:{port}', **ws_options)

To be honest I was very surprised when I learned that pyexasol resolves host names itself instead. I understand that the goal is to do load balancing by shuffling the IP addresses, but I would have expected that the network stack would already randomly pick one of the resolved IP addresses.

littleK0i commented 2 months ago

Resolving everything helps to address a few problems:

If at least one hostname is not available, you always get an exception. Otherwise you will get an exception only when "random" chooses a broken hostname, which leads to random errors in production.
When you have really large cluster with ever growing number of nodes, it makes sense to put all nodes on one hostname, like myexasol.mlan instead of having separate hostnames myexasol1..64.mlan, when number constantly changes. In this case redundancy will not work properly if hostname is not resolved beforehand, since we do not know if it points to one address or to multiple addresses.
For redundancy we do not want to try the same IP address twice. Afaik, it cannot be guaranteed if we do not connect by IP.

Maybe it would be easier to reconsider proxy configuration or use a different approach. I suspect only one client has this problem now after 6+ years.

Nicoretti commented 2 months ago

Hi @littleK0i,

As usual, thanks a lot for the insightful feedback. Regarding the HTTP transport, do you have a specific scenario in mind which isn't covered by the examples/tests?

I have been talking to @kaklakariada directly to get a better understanding of the overall issue.

From what I understand, this is most commonly an issue with SaaS where whitelisting a DNS name rather than IPs provides more flexibility. Also, I think it's not guaranteed that the client code author always has full control over the proxy and/or the network stack. Therefore, not having this functionality puts some folks out of luck.

While I am a huge fan of Pyexasol's performance mindset, I also understand the need for the Exasol standard Python driver to be flexible in its usage. I generally approach decisions with the mindset that the default should be performant, but other scenarios can be supported if needed.

TL;DR:

I am fine with adding such functionality if it does not break existing code:

No adjustments to existing client code calls needed
The default behavior remains unchanged
Does not degrade the performance of the existing behavior
...

Additionally, I will add a "Design Doc" where I will document important design decisions, including their background, as in this case.

Hope that's reasonable to everyone. If not, feel free to get back to me.

Best,
Nico

littleK0i commented 2 months ago

I think it might be helpful to implement something similar to Snowflake NETWORK POLICIES (https://docs.snowflake.com/en/user-guide/network-policies). Only clients using specific IP ranges are allowed to open connection.

In this case customers won't need to maintain extra rules on proxy, and the security will be much better overall. Also, this problem will be fixed automatically.

I would not be surprised if Exasol SaaS already has this feature in some shape or form.

littleK0i commented 2 months ago

I've looked through the code. I suspect HTTP transport (export_*, import_* functions) does not work with http_proxy at this moment.

If you're going to recommend more customers to use proxy instead of VPN or private networks, it might be worthwhile to extend proxy support to HTTP transport. Not sure about performance & overhead for big data transfers.

Nicoretti commented 2 months ago

Thanks @littleK0i for the additional hints and clarifications.

From what I understand, the NETWORK POLICIES are part of the issue. In the case of Exasol SaaS, they can be created based on DNS names and/or IP addresses. Regarding DNS names, it's not working due to the forced resolution of the DNS name as far as I understand. This seems due to the fact that the validation also considers the connection string or so, if I understood @kaklakariada correctly.

I'll be offline for the rest of the week, but I would like to discuss this further with @kaklakariada given the new information you've provided. Additionally, I think we can also give @kaklakariada the opportunity to "prove" that a viable patch for this issue is feasible, if he can provide an appropriate PR:

Pass all the tests & examples
No major refactoring (the change should not be too invasive)
...

@kaklakariada, it might also be a good idea to consider marking/documenting the parameter (configuration point) as not being part of the stable API yet.

@littleK0i, we should also keep in mind that this project isn't static. If we add this feature and it does more harm than good, we are willing to remove or rework it.

Best, Nico

kaklakariada commented 2 months ago

I should have provided some background. This feature is not meant for a customer but for an internal use case at our company. We need to implement a service that sits behind a proxy and accesses an Exasol SaaS instance. For security reasons this proxy allows connections only to certain host names and blocks access via IP addresses.

I will add integration tests that verify, that resolve_hostnames=True works with import, export and other features.

littleK0i commented 2 months ago

Aha! This changes quite a lot. In theory, it should be possible to affect design of internal services.

How about using a basic VPC instead, mainly for performance reasons? Proxy would naturally add extra 20-100ms to every request, which makes user experience a bit worse.

Also, proxy is a natural bottleneck in terms of network throughput. Even if the whole cluster can send more data in parallel, it will be limited by proxy, unless it is a multi-node proxy.

kaklakariada commented 2 months ago

I totally agree that a proxy is bad for performance. But in our use case we will have only a very small amount of data we query/insert. The goal of this feature is working around network restrictions and by no means a recommendation to users. I will clarify this in the documentation.

exasol / pyexasol

✨Allow configuring hostname resolution #151

Summary

Details

Background & Context

Task(s)