apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.49k stars 190 forks source link

Ballista context should get file metadata from scheduler, not from local disk #22

Open andygrove opened 3 years ago

andygrove commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. I have a Ballista cluster running, and each scheduler and executor has access to TPC-H data locally. I am running the benchmark client on my desktop, and I do not have access to the data locally. Query planning fails with "file not found" because BallistaContext::read_parquet is looking for the file on the local file system when it should be getting the file metadata from a scheduler in the cluster.

Describe the solution you'd like The context should send a gRPC request to the scheduler to get the necessary metadata.

Describe alternatives you've considered None

Additional context None

rdettai commented 3 years ago

@andygrove as the client is handling the logical plan, I think it does not need to know about the list of files or the statistics, it only needs the schema:

As flight already has an endpoint to query the schema, this would avoid creating and maintaining a new one 😃

yahoNanJing commented 2 years ago

Hi @andygrove, we have integrated ballista with HDFS support. Our workaround is to make the file path self described. For example, a local file path should be file://tmp/..., a hdfs file path should hdfs://localhost:xxx:/tmp/...

To make it work, we also changed the object store api a bit. Later I'll create a PR for this.

avantgardnerio commented 2 years ago

Later I'll create a PR for this.

@yahoNanJing this intersects work I'm currently working on, so anything you could share would be helpful!