audienceproject / spark-dynamodb

Plug-and-play implementation of an Apache Spark custom data source for AWS DynamoDB.
Apache License 2.0
175 stars 90 forks source link

Query Functionality #62

Open fogrid opened 4 years ago

fogrid commented 4 years ago

In the article published about the project, it was said that "future improvements could see the query operation be given a part to play as well".

Is adding a query functionality to this project on the roadmap for the near future? Thanks

jacobfi commented 4 years ago

Hi fogrid Unfortunately we are not actively working on any improvements at the moment. The idea behind this improvement would be to reduce read throughput when any filters present on the Spark query can be translated into conditions on the hash and/or range key. Do you currently have such a use case? Thank you for your interest in the project :)

fogrid commented 4 years ago

Yea, I want to get a small subset of keys from a very big table. What would adding this functionality entail?

jacobfi commented 4 years ago

Hi fogrid I imagine it would entail implementing a QueryPartition class similar to the existing ScanPartition, with the API query operation implemented alongside the other API calls in TableConnector. This QueryPartition would then be used in place of ScanPartition in the method planInputPartitions in DynamoDataSourceReader, based on some analysis of the Spark schema and Dynamo table schema to determine if a query would be applicable (and expedient) to use.

You are most welcome to contribute :) However if you need only a small subset of data from your table, it would probably be less work for you to query Dynamo directly and put the data into a Spark dataframe manually (and not use the library).

Thanks, Jacob

jacobfi commented 4 years ago

I started working on this feature here ee7c0f6 Can't promise anything but at least now there is a place to track progress 😀

ff-parasp commented 4 years ago

Hi jacobi,

Are you still actively working on adding support for query operation?

jacobfi commented 4 years ago

Hi ff-parasp No sadly, I am not actively working on this library at the moment.

amrnablus commented 3 years ago

Hi, I can work on this, @jacobfi mind if i continue from your feature branch?

jacobfi commented 3 years ago

Hi amrnablus You are most welcome. However be aware that the branch needs to be synced up with master, which underwent a lot of changes when it was migrated to Spark 3. Probably a rebase is the way to go, to isolate the query-related changes on the feature branch and play them on top of the new master.

amrnablus commented 3 years ago

Ah good point! I'll just start a new feature from master and copy of your changes manually, this should be easier. Thanks @jacobfi

talgos1 commented 3 years ago

Can I use the connector as the static DF for stream-static join (based on the key) ? Would it be efficient ?