read a mysql dump file, obtained from Minio. Note that we store compressed data on Minio, but it should be possible to directly read compressed file in Spark;
extract some data from it by specifying the: table and the columns we want to extract;
[Depends on #6] transform the data to a custom format used by BenchFlow (Custom format reference. This reference uses a MySQL database, we should port the format to Cassandra, by taking care of saving the data related to a benchmark_id in the same place. Some hints can be obtained from: http://blog.rackspace.com/cassandra-by-example/)
This task is a starting point to understand what we need to do in order to simplify this process for the users, by providing an already implemented library that does the most.
Something to evaluate:
Since Spark and Spark SQL has native support for csv file (as this nice example shows: http://stackoverflow.com/a/25366955), we can also think about performing dumps also in csv (functionalities that is largely supported by dump tools for databases) an work with them instead of sql dumps. For doing it I would use the same collector that does the mysqldump.
Pay attention to:
reuse the code whenever it is possible, to simplify the implementation of similar transformation for different database schemas
Structure the script so that it takes a JSON configuration file, and reads it to know the source tables and columns, and how to transform and map the data to a Cassandra table. Make the configuration file as simple as possible for the end user.
Provide predefined functions to use for transforming data (example: converting seconds to milliseconds). Allow also to pass custom Python functions.
Develop a task that:
read a mysql dump file, obtained from Minio. Note that we store compressed data on Minio, but it should be possible to directly read compressed file in Spark;extract some data from it by specifying the: table and the columns we want to extract;[Depends on #6] transform the data to a custom format used by BenchFlow (Custom format reference. This reference uses a MySQL database, we should port the format to Cassandra, by taking care of saving the data related to a benchmark_id in the same place. Some hints can be obtained from: http://blog.rackspace.com/cassandra-by-example/)load the data in a Cassandra databaseGive a look at Spark SQL providing a powerful tool for performing ETL on data. A nice example can be found on the following link: http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.htmlThis task is a starting point to understand what we need to do in order to simplify this process for the users, by providing an already implemented library that does the most.
Something to evaluate:Since Spark and Spark SQL has native support for csv file (as this nice example shows: http://stackoverflow.com/a/25366955), we can also think about performing dumps also in csv (functionalities that is largely supported by dump tools for databases) an work with them instead of sql dumps. For doing it I would use the same collector that does the mysqldump.Pay attention to: