FINRAOS / MegaSparkDiff

A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations of possible data sources. Multiple execution modes in multiple environments enable the user to generate a diff report as a Java/Scala-friendly DataFrame or as a file for future use. Comes with out of the box SparkFactory and SparkCompare tools.
https://finraos.github.io/MegaSparkDiff/
Apache License 2.0
49 stars 26 forks source link

Scale MSD to add connectivity for DynamoDB and validations. #52

Closed GoswamiSH closed 2 years ago

GoswamiSH commented 5 years ago

This issue is to add functionalities for data validation for DynamoDB.

mmlinford commented 5 years ago

Good idea. I imagine this would just mean creating a parallelizeDynamoDb() method in SparkFactory. To that end we might be able to take advantage of audienceproject/spark-dynamodb audienceproject/spark-dynamodb.

@GoswamiSH is there any other feature you'd need to have to consider this complete?

GoswamiSH commented 5 years ago

@mmlinford From the discussion with @aosama and @matthewgillett on the approach and extent of enhancements, I don't think we need any other feature as a pre-requisite to complete this issue.

Will look into audienceproject/spark-dynamodb and get back to you. Let's discuss this with @matthewgillett and @aosama as well.

matthewgillett commented 5 years ago

@GoswamiSH Please see the pull request #55 for DynamoDB support (as well as JSON format files). I used the https://github.com/awslabs/emr-dynamodb-connector dependency for reading the data from DynamoDB into Spark.