elixir-explorer / fss

FSS - File system specs is a small abstraction to describe how to access files on local or remote file systems.
Apache License 2.0
8 stars 3 forks source link

[Question] Is this a good place for FSSpec style APIs? #6

Closed pcapel closed 6 months ago

pcapel commented 6 months ago

I'm interested in integrating Elixir with a Data Lakehouse architected data layer. Currently, I'm looking at Apache Iceberg. The specification is open, which makes it attractive. In the process of looking into how I might implement it on Elixir, I determined that I would be served well by having a library like fsspec, as that's what the PyIceberg project uses to abstract the required filesystem operations from the spec.

So my first question is, would this project be amenable to PRs to build out those features?

I'm also curious if anyone has insight into any development in general Elixir community. For example, this seems like it would work well with Broadway. I know that Broadway has a Kafka connector, but I'm unable to find anything related to to an iceberg connector. I'm fairly new to the "data engineering" side of things, so this may be an issue of naïveté. If there are other ways of dealing with this in Elixir, I'd be appreciative of those being pointed out.

I'd be happy to work on creating a road map of the necessary features, pending that work isn't ongoing via some other project.

josevalim commented 6 months ago

While we would be open to extensions, it is worth noticing this library is different than the python one. This library only keeps the specification/credentials, it does not access anything, and therefore it does not provide operations such as reading and writing the file. This is because Certain libraries, such as livebook and explorer, want to have control over how files are accessed. For example, in explorer case, the specification/credential needs to be sent all the way up to rust.

My suggestion is to work on the bidings directly And use them as necessary in your data stack. You probably won’t need a shared data abstraction unless you run into very specific cases, which we could then discuss. :)

josevalim commented 6 months ago

I would also add that, because of Erlang requirements For resiliency and process scheduling, it is not as easy to call Erlang From C/rust Compared to calling python. This means that, if you want to integrate iceberg into explorer, you most likely Need to do it on the underlying Polars library, Because the polars library Cannot simply call elixir to perform a file system operation. We did discuss having a process based abstraction, which would make this possible, but one is not implemented yet.

pcapel commented 6 months ago

Ah! That's a really good point! In that case, it seems I should focus on helping with the features I want in the Rust Iceberg project and then create bindings for Elixir. I'm actually surprised I didn't think of that first. I got a little tunnel vision I guess.

Thanks for the clear thinking here. That's really helpful!