amphi-ai / amphi-etl

Visual Data Transformation with Python Code Generation. Low-Code Python-based ETL.
https://amphi.ai
Other
885 stars 43 forks source link

Comparison set of tools #166

Open simonaubertbd opened 3 weeks ago

simonaubertbd commented 3 weeks ago

Hello,

Something we do frequently is to compare two tables (like one in dev, one in prod and you want to be sure there isn't any regression).

Objectives :

-accurate and actionable -fast -works for any comparison : on the same database but also with files or separate databases -ability to deal with data qality issues : duplicate, last decimal, etc

Use cases

1/I'm developing a pipeline in TEST and I want to compare it to PROD. There may be some normal differences but globally should be the same.

2/I'm migrating my server. I want to compare all my old repository database to the new one.

3/I have some specs with field name and type and I want to be sure I didn"t forgot any fields and that they are correctly typed

-two modes : dataframes versus batch

dataframes : two inputs, two dataframes. batch : one input with several row, on each row the pair of comparison. detailed below

-two backends : in-memory versus in-database

===========> probably 4 different tools but it can use some common code

-two main algos :

*IUGF (Identify/Union/Group-by/Filter) Step 1 : union of table/dataframe with identification of origin (the origin field must be changeable). We use a integer/byte with 0/1 value. image

Step 2 : Group by on every field (except the origin field), with min/max/count agregation on the origin field (count is optional)

Step3 : Filter when the min=max (and for duplicate when count>2)

We can then have all the anomalies image

*a mix of hash of each rows (with a warning label about hashing collision) for fields+IUGF

"join" algo is excluded for different reasons : • Requires a key for join, and you don't always have a key • bad detection of duplicate • not easy to spot the difference • risk of doing a cartesian product if not controlled.

====> This algo can been applied on most level.

-ability to optionnally provide keys

Providing keys allows some deeper analysis for batch : in the batch format for dataframe : on the tool

-several levels of comparison

The user can choose several. Since comparison take time and you don't always need the detail, it's a good way to adapt. table/query/view or file level (batch only) : in batch mode, it checks that the tables or the files are the same partition level (batch only, in-db only) metadata level. Field Name, optionnally type. table/file aggregation level : check that some statistics are the same (min/max/count/count distinct of each field) row level (example above) field level (if keys specified) => by pivoting on keys+casting all fields to string/text/varchar. Two outputs : one for the detail and one with an agregation (there is a counting of row and count distinct of keys) concerned) ==> choice on tool

-ability to select AND exclude fields

for batch : in the batch format for dataframe : on the tool

-ability to force a number of decimal for float and double

for batch : in the batch format for dataframe : on the tool

-ability to force some formula

(especially in batch mode, less useful in dataframe mode where it can be done before the comparison)

batch format (to be completed)

in-memory

path (a file or a folder) or db connection+table data 1|where clause 1|path (a file or a folder) or db connection+table data 2|where clause 2|fields to select (* accepted)|fields to exclude in 1|fields to exclude in 2|rounding decimal (optional)|some formula for 1 (in json/xml..)|some formula for 2|key fields (optional)

in-database db connection|table data 1|where clause 1|table data 2|where clause 2|fields to select (* accepted)|fields to exclude in 1|fields to exclude in 2|rounding decimal (optional)|some formula for 1|some formula for 2|maybe the enclosing character and the format for casting field level|key fields (optional)|path for result

on the batch toolS, the user can map his "batch dataframe" to the expected format.

output :

There are several output, one by level of comparison.

On dataframe mode=> one dataframe by output On batch mode =>it takes the "path for result" and write in it for detailed results. It can be either a folder path (we would write one file / level of comparison) or a schema in the database for in-database (but also can be path in this case). +a dataframe taking the input dataframe and one column by control with a status : OK KO or N/A (if not computed)

some examples of output with an alteryx draft in dataframe mode

rows difference : image

metadata differences (note the flag origin should be 0, not true) image

field detailed difference (the colA field is the key) image

the field agregated difference (tjere should also be a count of row) image

some other examples of a qlik sense draft in batch mode image

Best regards

Simon

simonaubertbd commented 3 weeks ago

One of the purposes is to replace the famous SAAS Proc Compare https://www.listendata.com/2023/07/sas-proc-compare-with-examples.html