apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.5k stars 1.02k forks source link

No suitable object store found for *** #2136

Open ZhangqyTJ opened 2 years ago

ZhangqyTJ commented 2 years ago

I found this issue when trying to add s3-objectstore: I use the ctx.runtime_env().register_object_store() method to register the objectstore into the SessionContext, but I can't get the objectstore through get_by_uri() in the Scheduler. Exception: No suitable object store found for ***

This exception can be reproduced using LocalFileSystem. Reproduced method: ObjectStoreRegistry does not store LocalFileSystem when it is created, and registers LocalFileSystem in ObjectStoreRegistry.object_stores through the register_object_store() method. Run tpch-benchmark-ballista, the path parameter format is file://***/***

step

  1. Delete line 58 of object_store_registry.rs "map.insert(LOCAL_SCHEME.to_string(), Arc::new(LocalFileSystem));"
  2. Delete lines 243 and 244 of ballista/rust/client/src/context.rs and add "ctx.runtime_env().register_object_store("file",Arc::new(LocalFileSystem{}));" in line 245 under
  3. According to the "Ballista Benchmarks" content of benchmarks/README.md, start Scheduler and Executor, and execute the benchmark command: cargo run --bin tpch --release benchmark ballista ** --path file://** / The exception is thrown as: [2022-04-02T05:45:26Z ERROR ballista_scheduler::scheduler_server::grpc] Could not parse logical plan protobuf: Not implemented: No object store is registered for path file://E:/datafusion/lineitem.tbl: Internal ("No suitable object store found for file")
mingmwang commented 2 years ago

Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.

ZhangqyTJ commented 2 years ago

2. register_object_store

Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.

I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang

mingmwang commented 2 years ago
  1. register_object_store

Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.

I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang

Yes, ideally the objectstore just need to be registered once. But currently since there is no general serialization API for the objectstore, this is unlike JVM languages.

ZhangqyTJ commented 2 years ago
  1. register_object_store

Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.

I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang

Yes, ideally the objectstore just need to be registered once. But currently since there is no general serialization API for the objectstore, this is unlike JVM languages.

What is the correct way? Even if the method of registering objectstore is written in Scheduler and Executor, how should I pass parameters to Scheduler and Executor? For example: endpoint, username, password.Will the author team improve this part of the code? @mingmwang

yahoNanJing commented 2 years ago

Hi @ZhangqyTJ, my perspective for the object store for a standalone system is that we should leverage the ObjectStoreRegistry as a system-level one rather than the session-level one currently implemented. And we should not do the manually registration for object stores. Instead, we should register them by default. Therefore, we proposed #2111 to introduce object stores as optional features in the datafusion core. For HDFS as an example, the detailed implementation will load related configurations from configuration files under specific locations, which may also work for s3. For example, to put and load your configurations from ~/.datafusion/object_stores/s3.json. @matthewmturner

In the future, we should refine the current implementation from the following aspects:

thinkharderdev commented 2 years ago

Hi @ZhangqyTJ as @yahoNanJing mentioned, the current implementation assumes that the ObjectStore will be registered on both the scheduler and executor and the only way to do that currently is to manually do the registration by creating your own entrypoint for those services. But that is not really ideal and there is an active discussion in #2111 on what the best way to proceed on that is. We can serialize parameters in the protobuf message but for the most part an extension ObjectStore will not be entirely serializable and at minimum require a dependency that is either included in the compiled binary or loaded through a plugin mechanism (similar to what is being done for UDF/UDAF in #1881)

ZhangqyTJ commented 2 years ago

Hi @ZhangqyTJ, my perspective for the object store for a standalone system is that we should leverage the ObjectStoreRegistry as a system-level one rather than the session-level one currently implemented. And we should not do the manually registration for object stores. Instead, we should register them by default. Therefore, we proposed #2111 to introduce object stores as optional features in the datafusion core. For HDFS as an example, the detailed implementation will load related configurations from configuration files under specific locations, which may also work for s3. For example, to put and load your configurations from ~/.datafusion/object_stores/s3.json. @matthewmturner

In the future, we should refine the current implementation from the following aspects:

  • Extract the ObjectStoreRegistry as a system-level property for both of the Scheduler and the Executor.
  • Send the object store related configurations from the Scheduler to the Executors when doing the executor registration so that we don't need to put configuration files to every executor node.

ok