apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.99k stars 1.13k forks source link

Implement `DynamicTableProvider` in DataFusion Core #10986

Closed goldmedal closed 2 weeks ago

goldmedal commented 3 months ago

Is your feature request related to a problem or challenge?

I had some discussions with @alamb about supporting a dynamic file data source (select ... from 'select .. from 'data.parquet' like #4805) in the core, as mentioned in https://github.com/apache/datafusion/issues/4850#issuecomment-2142190951. However, we found that it's not a good idea to move so many dependencies (e.g., S3-related) to the core crate after #10745.

Describe the solution you'd like

As @alamb proposed in https://github.com/apache/datafusion/pull/10745#issuecomment-2175817937, we can focus first on the logic that interprets table names as potential object store locations. Implement a struct DynamicTableProvider and a trait called UrlLookup to get ObjectStore at runtime.

struct DynamicTableProvider {
  // ...
  /// A callback function that is 
  url_lookup: Arc<dyn UrlLookup>
}

/// Trait for looking up the correct object store instance based on URL
pub trait UrlLookup {
  fn lookup(&self, url: &Url) -> Result<Arc<dyn ObjectStore>>;
}

By default, DynamicTableProvider only supports querying local file paths like file:///.... The implementation of dynamic file queries in datafusion-cli might also be based on DynamicTableProvider but will load the common object storage dependency by default.

Describe alternatives you've considered

No response

Additional context

No response

goldmedal commented 3 months ago

take

alamb commented 3 months ago

Thank you @goldmedal

goldmedal commented 3 months ago

Hi @alamb,

I created a draft PR for this issue in #11035. After some experiments, I think passing only ObjectStore isn't enough for creating a TableProvider at runtime. We need to build the schema from a full SessionState.

Although there are many issues that need to be fixed, could you take a look at this PR to check if this idea makes sense when you're available?

Thanks.

goldmedal commented 3 months ago

I have finished the PR but I think there're two follow-up issues needed to be filed: