Comparison set of tools

Hello,

Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).

Objectives :

-accurate and actionable -fast -works for any comparison : on the same database but also with files or separate databases -ability to deal with data qality issues : duplicate, last decimal, etc

Use cases

1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.

2/I'm migrating my server. I want to compare all my old repository database to the new one.

3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed

-two modes : dataframes versus batch

dataframes : two inputs, two dataframes. batch : one input with several row, on each row the pair of comparison. detailed below

-two backends : in-memory versus in-database

===========> probably 4 different tools but it can use some common code

-two main algos :

*IUGF (Identify/Union/Group-by/Filter) Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value.

Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)

Step3 : Filter when the min=max (and for duplicate when count>2)

We can then have all the anomalies

*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF

"join" algo is excluded for different reasons : • Requires a key for join, and you don't always have a key • bad detection of duplicate • not easy to spot the difference • risk of doing a cartesian product if not controlled.

====> This algo can been applied on most level.

-ability to optionnally provide keys

Providing keys allows some deeper analysis for batch : in the batch format for dataframe : on the tool

-several levels of comparison

The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt. table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same partition level (batch only, in-db only) metadata level. Field Name, optionnally type. table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field) row level (example above) field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned) ==> choice on tool

-ability to select AND exclude fields

for batch : in the batch format for dataframe : on the tool

-ability to force a number of decimal for float and double

for batch : in the batch format for dataframe : on the tool

-ability to force some formula

(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)

batch format (to be completed)

in-memory

on the batch toolS, the user can map his "batch dataframe" to the expected format.

output :

There are several output, one by level of comparison.

On dataframe mode=> one dataframe by output On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) or a schema in the database for in-database (but also can be path in this case). +a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)

some examples of output with an alteryx draft in dataframe mode

rows difference :