While running the validation job with cassandra-data-migrator, we've noticed that if there are write-ins from zdm-proxy during the process, the validation job may overwrite these write-ins, resulting in data not being updated to the latest.
For instance:
Time t1: diff data
Time t2: write to zdm-proxy
Time t3: autocorrect
Currently, our solution is to run the validation job multiple times until there are no differences. However, this approach is time-consuming as the validation process starts from scratch each time. The CSV file, which could potentially help us save time by specifying the range to validate, is only written when there's an error. To optimize this, we propose to output the CSV file whenever there are differences (not just errors). This way, we can feed the CSV file into subsequent validation runs, focusing on the problematic ranges, thereby reducing the overall execution time.
We propose adding an option, such as spark.cdm.tokenrange.partitionFile.appendOnDiff. When spark.cdm.tokenrange.partitionFile.appendOnDiff=true, the partition range would be outputted if there are any differences. This change will be backward compatible, as it only affects behavior when the new option is explicitly set to true.
Additionally, we would like the input and output CSV files to be different. Thus, we suggest adding two more options: spark.cdm.tokenrange.partitionFile.output and spark.cdm.tokenrange.partitionFile.input, to specify the input and output CSV files respectively. These changes are also designed to be backward compatible, as they only change behavior when the new options are used.
We have already implemented these features in our fork of the project. If these changes align with the project's direction, we would be more than happy to create a pull request. This would allow the community to review the changes and potentially integrate them into the main project. We believe these enhancements would greatly improve the efficiency of the validation process.
While running the validation job with cassandra-data-migrator, we've noticed that if there are write-ins from zdm-proxy during the process, the validation job may overwrite these write-ins, resulting in data not being updated to the latest.
For instance:
Currently, our solution is to run the validation job multiple times until there are no differences. However, this approach is time-consuming as the validation process starts from scratch each time. The CSV file, which could potentially help us save time by specifying the range to validate, is only written when there's an error. To optimize this, we propose to output the CSV file whenever there are differences (not just errors). This way, we can feed the CSV file into subsequent validation runs, focusing on the problematic ranges, thereby reducing the overall execution time.
We propose adding an option, such as
spark.cdm.tokenrange.partitionFile.appendOnDiff
. Whenspark.cdm.tokenrange.partitionFile.appendOnDiff=true
, the partition range would be outputted if there are any differences. This change will be backward compatible, as it only affects behavior when the new option is explicitly set to true.Additionally, we would like the input and output CSV files to be different. Thus, we suggest adding two more options: spark.cdm.tokenrange.partitionFile.output and spark.cdm.tokenrange.partitionFile.input, to specify the input and output CSV files respectively. These changes are also designed to be backward compatible, as they only change behavior when the new options are used.
We have already implemented these features in our fork of the project. If these changes align with the project's direction, we would be more than happy to create a pull request. This would allow the community to review the changes and potentially integrate them into the main project. We believe these enhancements would greatly improve the efficiency of the validation process.