apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.19k stars 14.05k forks source link

HTTP connection parse path as schema but use as protocol #10913

Open dungdm93 opened 3 years ago

dungdm93 commented 3 years ago

Apache Airflow version: 1.10.11

Environment:

What happened: I store my connection into Airflow using this command

$ airflow connections --add --conn_id=demo --conn_uri="http://localhost:8080/path/foo/bar"
Successfully added `conn_id`=demo : http://localhost:8080/path/foo/bar

And when It's used by hooks, I realize that HTTP connection parse path as schema but use as protocol:

from airflow.hooks.http_hook import HttpHook

c = HttpHook.get_connection("demo")
print(f"type={c.conn_type} schema={c.schema}, user={c.login}, pass={c.password}, host={c.host}, port={c.port}, extra={c.extra}")
# type=http schema=path/foo/bar, user=user, pass=pass, host=localhost, port=8080, extra=None

h = HttpHook(http_conn_id="demo")
h.get_conn()
print(h.base_url)
# 'path/foo/bar://localhost:8080' 

For now, I could workaround by store hold URI in host field like

$ airflow connections --add --conn_id=demo2 --conn_type=http --conn_host=http://localhost:8080/path/foo/bar
Successfully added `conn_id`=demo2 : http://:******@http://localhost:8080/path/foo/bar:

But you could see, it's absolutely bad solution.

So my suggestion is the connection should be sanitize follow the standard URL syntax: image

This mean that:

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 3 years ago

Yeah @dungdm93 - this is a known problem, It's not a bug per-se, more of a naming problem but it's hard to solve in a backward-compatible way. Would you like to attempt to solve it this way that the previous usages will be not affected ? It might be temporary (with some deprecation notice) but we have to make sure that at least when people migrate to 2.0, their code still works and at most they receive a deprecation warning.

potiuk commented 3 years ago

And it would be great to solve it BTW. it's pretty anoying.

mik-laj commented 3 years ago

Related: https://github.com/apache/airflow/pull/10256

kubatyszko commented 3 years ago

Btw, this also affects some other protocols, so far I've identified HTTP/HTPS and SPARK:

Sample URI's that I have in my Secrets Manager :

http://https%3A%2F%2Fexample.com

spark://spark%3A%2F%2Fspark-master-0.spark-master.spark.svc.cluster.local:7077

I've also just added a piece of documentation that explains the strange URI's required. We should keep it until this current Issue taken care of... (but I recall from my conversation in the past that it was not easy to fix this).

vashchukmaksim commented 3 years ago

The above doesn't work for me, so I came up with the following that works in my case for HTTP:

AIRFLOW_CONN_API_AUTH=http://app_auth:8000/http

So I added http at the end and later use relative path in SimpleHttpOperator's endpoint argument.

curlup commented 1 year ago

I know it's not a bug per-se, but maybe a path forward would be to introduce new http and https hook and name it differently (RestHook or httpApiHook?) and then deprecate the current one, not breaking anyones code/pipelines behavior?

potiuk commented 1 year ago

Why not. Maybe you would like to attempt to do it? PRs are welcome :)

Bisk1 commented 9 months ago

I would like to revive this thread because recently I've spend a lot of time when I was trying to set up an HTTPS connection because of how confusing connection URI is. Currently if I want to do it then the URI has to be either http://example.com/https OR http://https://example.com but the most intuitive one: https://example.com doesn't work.

I also noticed that there was more work done on supporting the second pattern: https://github.com/apache/airflow/pull/31465 which makes it seem like it is now the recommended solution rather than a workaround.

I feel like there is a design decision needed here. Either Airflow committers agree that the current URI syntax: https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#uri-format is the final one and in that case this issue should be closed.

Alternatively (which is in my opinion better) we could try to aim at changing the URI syntax so that only the 3rd pattern is valid and others are deprecated. E.g. schema (actually 'scheme') and protocol could be considered as one and the same thing in the connection model (in future, now it would be a breaking change). And connection type can be derived from the scheme in code (e.g. if the scheme is HTTPS then conn_type is HTTP, in almost all other cases conn_type = scheme). Then the path part is not needed and hostname should not contain protocol because it will be available in scheme.

I'm willing to propose some solution and deprecation path (so that the old patterns are only supported until Airflow 3) if the community sees value in this idea.

potiuk commented 9 months ago

I do not think it is final. I think we would be entirely open to get more "intutitive" way of defining and storing the connection as long as it does not introduce backwards compatiblity - then it can even be made "preferred" and the old one "deprecated" - I have completley no problem with that.

Bisk1 commented 9 months ago

Ok, after looking into this issue again I see that parts of my comment were mistaken - I realised that 'schema' in connection is not 'scheme' with a typo but a database schema. This is because these terms were mixed in HttpHook implementation. I think my PR (https://github.com/apache/airflow/pull/35712) should fix this issue. But encoding protocol in the hostname is a separate thing so I will leave it for now.