data-engineering-collective / minimalkv

A minimal key-value store interface for binary data (maintained fork of simplekv).
https://minimalkv.readthedocs.io/en/latest/index.html
Other
17 stars 9 forks source link

Support registering new backends #45

Open janjagusch opened 2 years ago

janjagusch commented 2 years ago

As a user of the minimalkv framework, I want to create and register new backends and use them through the get_store_from_url function without changing the minimalkv library. This is currently not possible, as the extract_params function hard-codes the known storage types:

https://github.com/data-engineering-collective/minimalkv/blob/main/minimalkv/_urls.py#L70-L122

What I'm imagining is a registration function that makes minimalkv aware of this new storage type, similarly to how fsspec does it.

janjagusch commented 2 years ago

Having looked at the code more thoroughly, I think we should:

  1. Add integration tests for get_store_from_url
  2. Add a classmethod from_url(cls, url: str) -> ""KeyValueStore" to the definition of KeyValueStore. Alternatively, we could add a classmethod from_parsed_url(cls, scheme, host, port, path, query, userinfo) -> "KeyValueStore", similar to extract_params.
  3. get_store_from_url then only looks up the schema and delegates the rest of the logic to from_(parsed)_url
  4. Finally, we mirror what fsspec does and allow registering additional stores. Registration would happen through an entry point in the setup.py.

Apart from now being able to register additional stores, this should also clean up _urls.py, _get_store.py, _store_creation.py, and _store_decoration.py.

FYI: @SimonBohnenQC

simonbohnen commented 1 year ago

This concept sounds much better than the current structure 👍🏼

As authentication was an issue in #51, I would suggest testing get_store_from_url against live AWS / GCS. Can we provide credentials as GitHub secrets to do this @janjagusch?

simonbohnen commented 1 year ago

We would probably not be using create_store, url2dict, and get_store internally anymore. We should probably still keep them available, right? Deprecating them seems like the best choice.

xhochy commented 1 year ago

I would love to keep create_store as this gives a way to pass parameters where there can also be more schema validation upfront. A typical use case is to have the store configuration in a JSON/YAML and then use create_store(json.load(…)).

simonbohnen commented 1 year ago

I would love to keep create_store as this gives a way to pass parameters where there can also be more schema validation upfront. A typical use case is to have the store configuration in a JSON/YAML and then use create_store(json.load(…)).

I see the point of creating stores only from str or int parameters. I would also prefer if store objects could be created without relying on an internet connection or successful authentication. Thus, I'd like to move e.g. the create_store_azure code to a constructor for the AzureBlockBlobStore class.

What do you mean by "schema validation" in this context?

xhochy commented 1 year ago

With "schema validation", I mean that you can specify a type schema for JSON/YAML files and validate that they have the correct values before creating the stores themselves. This has been useful in the past in using the stores in some settings to check whether the configuration is valid before deploying it to production.