Support additional `from` datasources within Zed

brimdata / super

An analytics database that puts JSON and relational tables on equal footing

BSD 3-Clause "New" or "Revised" License

1.4k stars 67 forks source link

As a noob to the Zed ecosystem,While playing with the zed and zq CLI, and was looking for an ability to integrate it with other data sources within the zed pipeline. I know this can currently be satisfied using other tools like curl, find, psql, etc... But, I would like to have the ability to integrate these sources directly within Zed for the following benefits:

Ability to build a more efficient and streamlined dataflow pipeline since the data source integrations can leverage the full power of Go's capabilities.
Simplify the interface into zed for new users (albeit at a cost to additional complexity within the supported syntax). I think many users will lean towards the tools they know using shell pipelines to feed this data into zed, but I feel like this would severely limit the power of the dataflow pipelines that zed can create (but my experience with Zed and Go in general is quite new).
Allow data throughput to be controlled such that you can build data sources that potentially pull from infinite streams. Here I'm thinking of activities like paging through API endpoints, SQL query results, or files on a file system.

Is this something that Zed already supports?

If not, is this use case something that aligns with vision and goals for Zed?

I believe there is a similar ticket here, that is focused on extending the from / get operator to support HTTP requests. Should this go there?

A somewhat related tool in this space is steampipe, which has a plugin capability to enable new data sources. I've been a huge fan of their product, but Postgresql syntax seems a bit heavy at times. I think Zed's data lake concept plus the power of its query tools would be a really attractive alternative.

As those of us involved with the Zed project often say, "the architecture supports it". 😉 These concepts are indeed in line with the direction the project is headed. That said, as there's limited Dev resources available today, the bulk of the effort lately and in the near future is likely to be more toward the core of the tech, e.g., ensuring solid performance and ease of management with data once it's in the system. Making it easy for users to get data into the system is still important. However, the way we'll likely enable this in the short term is by publishing best practices for tools that have existing integrations with a diverse set of inputs, then show how those tools can easily push their data onward into Zed. Two recent prototyping efforts along these lines have been Logstash (#3151) and Fluentd (#4271) and pretty soon I expect we'll publish more formal docs that turn the findings in those issues into "best practices". As noted above, we also recognize that the existing get <uri> variation of from should probably be extended to cover other HTTP methods and parameters/payloads that could allow for hitting a wider set of REST APIs (#4225).

We'll keep this issue open as a record of our intent to make ongoing investment in this area. Thanks for your interest in Zed!

brimdata / super

Support additional `from` datasources within Zed #4337