Open ZhangqyTJ opened 2 years ago
Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.
2. register_object_store
Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.
I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang
- register_object_store
Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.
I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang
Yes, ideally the objectstore just need to be registered once. But currently since there is no general serialization API for the objectstore, this is unlike JVM languages.
- register_object_store
Do you modify the Ballista source code and try to register s3-objectstore ? You need to register the objectstore to runtime env in multiple components: Ballista Client, Ballista Scheduler and Ballista Executor.
I think the correct logic is: use the code "sessionContext.runtime_env().register_object_store("s3",Arc::new(S3FileSystem{}));" to register the objectstore, Scheduler and Executor to obtain the serialized s3-objectstore through the network , instead of writing the code for registering s3-objectstore in both Scheduler and Executor. If you do that, the parameters of the objectstore also need to be passed over the network. @mingmwang
Yes, ideally the objectstore just need to be registered once. But currently since there is no general serialization API for the objectstore, this is unlike JVM languages.
What is the correct way? Even if the method of registering objectstore is written in Scheduler and Executor, how should I pass parameters to Scheduler and Executor? For example: endpoint, username, password.Will the author team improve this part of the code? @mingmwang
Hi @ZhangqyTJ, my perspective for the object store for a standalone system is that we should leverage the ObjectStoreRegistry as a system-level one rather than the session-level one currently implemented. And we should not do the manually registration for object stores. Instead, we should register them by default. Therefore, we proposed #2111 to introduce object stores as optional features in the datafusion core. For HDFS as an example, the detailed implementation will load related configurations from configuration files under specific locations, which may also work for s3. For example, to put and load your configurations from ~/.datafusion/object_stores/s3.json. @matthewmturner
In the future, we should refine the current implementation from the following aspects:
Hi @ZhangqyTJ as @yahoNanJing mentioned, the current implementation assumes that the ObjectStore
will be registered on both the scheduler and executor and the only way to do that currently is to manually do the registration by creating your own entrypoint for those services. But that is not really ideal and there is an active discussion in #2111 on what the best way to proceed on that is. We can serialize parameters in the protobuf message but for the most part an extension ObjectStore
will not be entirely serializable and at minimum require a dependency that is either included in the compiled binary or loaded through a plugin mechanism (similar to what is being done for UDF/UDAF in #1881)
Hi @ZhangqyTJ, my perspective for the object store for a standalone system is that we should leverage the ObjectStoreRegistry as a system-level one rather than the session-level one currently implemented. And we should not do the manually registration for object stores. Instead, we should register them by default. Therefore, we proposed #2111 to introduce object stores as optional features in the datafusion core. For HDFS as an example, the detailed implementation will load related configurations from configuration files under specific locations, which may also work for s3. For example, to put and load your configurations from ~/.datafusion/object_stores/s3.json. @matthewmturner
In the future, we should refine the current implementation from the following aspects:
- Extract the ObjectStoreRegistry as a system-level property for both of the Scheduler and the Executor.
- Send the object store related configurations from the Scheduler to the Executors when doing the executor registration so that we don't need to put configuration files to every executor node.
ok
I found this issue when trying to add s3-objectstore: I use the
ctx.runtime_env().register_object_store()
method to register the objectstore into the SessionContext, but I can't get the objectstore throughget_by_uri()
in the Scheduler. Exception: No suitable object store found for ***This exception can be reproduced using LocalFileSystem. Reproduced method: ObjectStoreRegistry does not store LocalFileSystem when it is created, and registers LocalFileSystem in ObjectStoreRegistry.object_stores through the register_object_store() method. Run tpch-benchmark-ballista, the path parameter format is
file://***/***
step
map.insert(LOCAL_SCHEME.to_string(), Arc::new(LocalFileSystem));
"ctx.runtime_env().register_object_store("file",Arc::new(LocalFileSystem{}));
" in line 245 under