dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
383 stars 71 forks source link

Connect to Apache Atlas #99

Open ibnubay opened 3 years ago

ibnubay commented 3 years ago

I would love if this 2 project can merge their great features, to be more powerful tools 🙂

ibnubay commented 3 years ago

May be it will be great connect to apache atlas for more connection to others app

nils-braun commented 3 years ago

That is a very nice feature request! I think that should already work "out of the box" if you are using python syntax (as intake is able to export dask dataframes, which then can be used in dask-sql after registration as table). It would be super cool if we can even do this from SQL, e.g.

CREATE TABLE "test" WITH (format = "intake", location = "intake-catalog-location")

or something like this. What do you think?

For Apache Atlas I have not much experience - do you know if it is possible to connect from dask to Apache Atlas?

ibnubay commented 3 years ago

Apache Atlas can connect to python, and may be reuse code connect to Hive currently used.

nils-braun commented 3 years ago

That would be very great! Could you share a typical workflow with Apache Atlas and the python client? As I said, I do not have much experience with Apache Atlas (sorry..) and for me Atlas is mostly a tool for Audit-safety and compliance (not so much as a metadata catalog by itself).

martindurant commented 3 years ago
CREATE TABLE "test" WITH (format = "intake", location = "intake-catalog-location")

what do you need from Intake to make this happen? I assume we require the named data source to be of "dataframe" type, and will use it by calling to_dask().

nils-braun commented 3 years ago

Without having worked with intake before (but fortunately the docs are good ;-)), I assume all dask-sql would need to do on a call like

CREATE TABLE "test" WITH (format = "intake", location = "intake-catalog-location", table_name = "my-table")
-- table_name could be non-mandatory and would use "test" otherwise

would be

cat = intake.open_catalog("intake-catalog-location")
df = cat["my-table"].to_dask()

Is this correct? If this is the case, that would already be everything :-)

martindurant commented 3 years ago

That is correct - although it's possible to optionally allow extra arguments when opening the catalog or accessing the specific dataset.

nils-braun commented 3 years ago

@ibnbay99 - the intake part is included now. Do you have a code sample for Apache Atlas?