[Spike] Data Storage - Githubissues

A data marketplace is a collection of data sets for a specific vertical, e.g. genomics data.
In a single submission, a user could upload multiple flat files. This collection of files would constitute a single data set.
The metadata of a data set will be stored separately from the actual data set, potentially in a RDBMS. Examples of metadata might include who submitted the data set, which flat files constitute a data set, which data marketplace the data set is associated with, etc.
Once a data set has been added to a data marketplace, the owner of this data set can opt to remove it from the data marketplace at any time.

[ ] What does a Spark solution look like for a single data marketplace, where all data sets have the same schema?
[ ] How does Spark handle multiple schemas?
[ ] How would Spark handle queries across multiple data marketplaces, where data marketplaces may be sharded across separate data stores?
[ ] What does a Spark query look like?
[ ] Can we persist references to subsets of data in a data marketplace, e.g. queries, pointers?
[ ] How does Spark handle sharding or cloning of a data store?
[ ] How does encryption fit into our solution? The actual encryption implementation doesn’t matter as much at this point, but we should have an understanding of where it fits in and how we would begin implementation.
[ ] After gaining a better understanding of our platform/solution, what are the pros/cons of Spark? What are some alternative solutions? How does Spark compare to alternative solutions?
[ ] Based on Tim’s experience with these technologies, what stands out as the largest concern?
[ ] Which pieces of tech or potential use cases stand out as the most interesting to Tim?

computablelabs / spark-experiments