datafold / data-diff

Compare tables within or across databases
https://docs.datafold.com
MIT License
2.95k stars 271 forks source link

Databricks connection does not close properly when using --dbt #873

Closed leoebfolsom closed 6 months ago

leoebfolsom commented 8 months ago

Describe the bug The connection stays open for a while until it dies. Diff results are successfully returned to the user, along with an error message.

Version info:

(datafold-demo) ➜  datafold-demo git:(test-2) data-diff --version
v0.11.1

Excerpt of debug output:

Running with data-diff=0.11.1
08:28:30 INFO     Parsing file dbt_project.yml                                                                  dbt_parser.py:287
         INFO     Parsing file                                                                                  dbt_parser.py:280
                  /Users/leoebfolsom/git/leoebfolsom-databricks-demo/datafold-demo/target/manifest.json                          
         INFO     Parsing file target/run_results.json                                                          dbt_parser.py:253
         INFO     config: prod_database='dev' prod_schema='dbt_leoebfolsomdatabricksdemo_prod'                  dbt_parser.py:159
                  prod_custom_schema=None datasource_id=7498                                                                     
         INFO     Parsing file /Users/leoebfolsom/.dbt/profiles.yml                                             dbt_parser.py:294
         DEBUG    Found PKs via Tags [dim_orgs]: ['org_id']                                                     dbt_parser.py:454
         INFO     [Databricks] Starting a threadpool, size=1.                                                        base.py:1237
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SET TIME ZONE 'UTC'                                                                                            
08:28:34 DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`)             
                  AS LIMITED_SELECT LIMIT 64                                                                                     
         DEBUG    Mixed Alphanum/Non-Alphanum values detected in column                                              base.py:1153
                  dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs.sub_plan. It cannot be used as a key.                          
         INFO     [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>),       schema.py:44
                  'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],                  
                  precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,                   
                  rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],                     
                  precision=0, python_type=<class 'int'>)}                                                                       
08:28:35 DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`)              
                  AS LIMITED_SELECT LIMIT 64                                                                                     
         DEBUG    Mixed Alphanum/Non-Alphanum values detected in column                                              base.py:1153
                  dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs.sub_plan. It cannot be used as a key.                           
         INFO     [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>),       schema.py:44
                  'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],                  
                  precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,                   
                  rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],                     
                  precision=0, python_type=<class 'int'>)}                                                                       
         DEBUG    Testing for duplicate keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:230
                  ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                       
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM                                       
                  `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`                                                          
         INFO     Validating that the are no duplicate keys in columns: ['org_id'] for ('dev',             joindiff_tables.py:243
                  'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                                                              
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS                    
                  `total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`                                    
         DEBUG    Running SQL (Databricks):                                                                           base.py:982
                  SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM                                       
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`                                                           
         INFO     Diffing segments at key-range: (100000001)..(100000158). size: table1 <= 157, table2 <= 157  diff_tables.py:318
         INFO     . Diffing segment 1/32, key-range: (100000001)..(100000005), size <= None                joindiff_tables.py:183
         DEBUG    Collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:270
         DEBUG    Querying for different rows: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')   joindiff_tables.py:208
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS                   
                  `count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >= 100000001)               
                  AND (`org_id` < 100000005)                                                                                     
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT * FROM (SELECT (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS              
                  `is_exclusive_b`, CASE WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS              
                  `is_diff_org_id`, CASE WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0               
                  END AS `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1              
                  ELSE 0 END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from                           
                  `tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN                               
                  `tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,            
                  CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS                          
                  `is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)            
                  AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as                 
                  string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`            
                  as string) AS `sub_plan_b`,                                                                                    
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,                            
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,                            
                  cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS                     
                  `num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE                  
                  (`org_id` >= 100000001) AND (`org_id` < 100000005)) `tmp1` FULL OUTER JOIN (SELECT * FROM                      
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND (`org_id` <             
                  100000005)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR               
                  (`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR                     
                  (`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))                                                       
         INFO     Validating that the are no duplicate keys in columns: ['org_id'] for ('dev',             joindiff_tables.py:243
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS                    
                  `total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`                                     
08:28:36 DEBUG    Done collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod',        joindiff_tables.py:306
                  'dim_orgs')                                                                                                    
         DEBUG    Collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')  joindiff_tables.py:270
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                  base.py:980
                  SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS                   
                  `count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND            
                  (`org_id` < 100000005)                                                                                         
         DEBUG    Done collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev',         joindiff_tables.py:306
                  'dim_orgs')                                                                                                    
         DEBUG    Testing for null keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <>      joindiff_tables.py:252
                  ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                       
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')                 base.py:980
                  SELECT `org_id` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >=                  
                  100000001) AND (`org_id` < 100000005) AND (`org_id` IS NULL)                                                   

.... many repetitive such logs ...

         DEBUG    Counting differences per column: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod',           joindiff_tables.py:346
                  'dim_orgs') <> ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                        
         DEBUG    Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev',      base.py:980
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               
                  SELECT sum(`is_diff_org_id`), sum(`is_diff_sub_price`), sum(`is_diff_sub_plan`),                               
                  sum(`is_diff_sub_created_at`), sum(`is_diff_created_at`), sum(`is_diff_num_users`) FROM (SELECT                
                  (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS `is_exclusive_b`, CASE             
                  WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS `is_diff_org_id`, CASE              
                  WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0 END AS                               
                  `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1 ELSE 0              
                  END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from                                  
                  `tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN                               
                  `tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,            
                  CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS                          
                  `is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)            
                  AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as                 
                  string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`            
                  as string) AS `sub_plan_b`,                                                                                    
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /              
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,                        
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,                            
                  date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /                  
                  1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,                            
                  cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS                     
                  `num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE                  
                  (`org_id` >= 100000125) AND (`org_id` < 100000158)) `tmp1` FULL OUTER JOIN (SELECT * FROM                      
                  `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000125) AND (`org_id` <             
                  100000158)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR               
                  (`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR                     
                  (`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))                                                       
         INFO     Diffing complete: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev',   joindiff_tables.py:165
                  'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')                                                               

dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs <> dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs 
Primary Keys: ['org_id'] 

rows       PROD    <>            DEV
---------  ------  ------------  -----------------
Total      157                   10 [-147]
Added              0
Removed            -147
Different          0
Unchanged          10

columns           # diff values
--------------  ---------------
created_at                    0
num_users                     0
sub_created_at                0
sub_plan                      0
sub_price                     0 

08:29:38 ERROR    Attempt to close session raised a local exception:                                                client.py:277
                  HTTPSConnectionPool(host='<redacted>.cloud.databricks.com', port=443): Max retries                      
                  exceeded with url: /sql/1.0/warehouses/<redacted> (Caused by SSLError(SSLError(0,                        
                  'Underlying socket connection gone (_ssl.c:2407)')))                                                           

Describe the environment

OS: MacBook Pro 16-inch, 2021; Apple M1 MaX; 64 GB Mem; Venture 13.4.1 data-diff version: 0.11.1 (I also found the same issue in 0.9.17)

github-actions[bot] commented 6 months ago

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

github-actions[bot] commented 6 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.