Does JuliaDB support distributed filesystems like HDFS or AWS S3?

schlichtanders commented 4 years ago

thanks for this awesome package!

It looks like JuliaDB can well be put into work over a distributed cluster of computation units, and is not limited to a single machine setup.

Having such a system, it would be natural to also load data in a distributed setting, e.g.

from all the different local filesystem
from HDFS directly
or from AWS S3 for instance

Take S3 for example, it would be really nice, if we could load data from an S3 folder with thousands of large parquet files, by downloading them onto separate processors, such that the data is spread over the cluster and we have a final JuliaDB table which encompasses all the data.

With HDFS one can imagine that you want to run JuliaDB on top of an AWS EMR cluster, where HDFS is already installed and you would like to load an HDFS folder with thousands of large parquet files. It would be awesome if JuliaDB could communicate with the HDFS (e.g. via Elly.jl) and load the data in a distributed manner from HDFS.

I guess the S3 example is most flexible, like having an intelligent distributed download into a distributed JuliaDB table.

Is something like this already possible with JuliaDB or on the roadmap? Thanks a lot

schlichtanders commented 3 years ago

Is there any plan to support distributed file systems like HDFS or cloud storage? (e.g. S3)

Maybe this comes together with supporting a more flexible loadtable function, could this already be it?

jpsamaroo commented 3 years ago

It'd be a welcome addition, but I don't think anyone is actively looking at adding such a feature to JuliaDB, so you'll probably need to implement it yourself.

schlichtanders commented 3 years ago

Thank you so much for your reaction. Also for me it is a time question. I put it on my list of possible next julia projects.

Do you know whether there is a roadmap for JuliaDB? Like what is going to be implemented next?

jpsamaroo commented 3 years ago

JuliaDB is basically in maintenance mode right now. If there's going to be a future for this package, it will be because the community decides to make it so. The original developers appear to have moved on to other things, and are probably not likely to commit to large refactorings and feature additions, but they would probably help with PR review.

mahiki commented 2 years ago

I think it should be possible to use AWSS3.jl to provide abstract file paths into s3 for loading the data into the DB.

Though I haven't used JuliaDB yet, it seems like a great tool as a big data analytics engine. The serialized output format is the main limitation right now, until JuliaDB can write to a standard format like Arrow or Parquet or something the DB artifacts should be considered temporary workspace that will get regenerated from static files.

In this way JuliaDB can be used as a (extremely inexpensive) MPP DB engine like Redshift Spectrum or Spark. The "DB" is the stored files which so far cannot be written by JuliaDB. Yeah, now that I wrote this down this is a huge missing feature.

JuliaData / JuliaDB.jl

Does JuliaDB support distributed filesystems like HDFS or AWS S3? #328