duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
139 stars 15 forks source link

delta_scan is many times slower than parquet reading #35

Closed lyogev closed 2 months ago

lyogev commented 5 months ago

Hi, just started using delta_scan (using the nightly-built extension) and I'm getting bad performance on top of remote files. I believe this is due to delta kernel not using all my threads. I'm setting my duckdb threads to 60. I'm getting a bit better performance (still not parquet level) with using the delta-rs lib:

        delta_table = DeltaTable(location, storage_options=storage_options)
        cursor.register(table, delta_table.to_pyarrow_dataset())

Attaching profiles. Delta:

{
   "name":  "Query", 
   "result": 41.423896,
   "timing": 41.423896,
   "cardinality": 0,
   "extra-info": "WITH wscs AS\n  (SELECT sold_date_sk,\n          sales_price\n   FROM\n     (SELECT ws_sold_date_sk sold_date_sk,\n             ws_ext_sales_price sales_price\n      FROM web_sales\n      UNION ALL SELECT cs_sold_date_sk sold_date_sk,\n                       cs_ext_sales_price sales_price\n      FROM catalog_sales) sq1),\n     wswscs AS\n  (SELECT d_week_seq,\n          sum(CASE\n                  WHEN (d_day_name='Sunday') THEN sales_price\n                  ELSE NULL\n              END) sun_sales,\n          sum(CASE\n                  WHEN (d_day_name='Monday') THEN sales_price\n                  ELSE NULL\n              END) mon_sales,\n          sum(CASE\n                  WHEN (d_day_name='Tuesday') THEN sales_price\n                  ELSE NULL\n              END) tue_sales,\n          sum(CASE\n                  WHEN (d_day_name='Wednesday') THEN sales_price\n                  ELSE NULL\n              END) wed_sales,\n          sum(CASE\n                  WHEN (d_day_name='Thursday') THEN sales_price\n                  ELSE NULL\n              END) thu_sales,\n          sum(CASE\n                  WHEN (d_day_name='Friday') THEN sales_price\n                  ELSE NULL\n              END) fri_sales,\n          sum(CASE\n                  WHEN (d_day_name='Saturday') THEN sales_price\n                  ELSE NULL\n              END) sat_sales\n   FROM wscs,\n        date_dim\n   WHERE d_date_sk = sold_date_sk\n   GROUP BY d_week_seq)\nSELECT d_week_seq1,\n       round(sun_sales1/sun_sales2, 2) r1,\n       round(mon_sales1/mon_sales2, 2) r2,\n       round(tue_sales1/tue_sales2, 2) r3,\n       round(wed_sales1/wed_sales2, 2) r4,\n       round(thu_sales1/thu_sales2, 2) r5,\n       round(fri_sales1/fri_sales2, 2) r6,\n       round(sat_sales1/sat_sales2, 2)\nFROM\n  (SELECT wswscs.d_week_seq d_week_seq1,\n          sun_sales sun_sales1,\n          mon_sales mon_sales1,\n          tue_sales tue_sales1,\n          wed_sales wed_sales1,\n          thu_sales thu_sales1,\n          fri_sales fri_sales1,\n          sat_sales sat_sales1\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001) y,\n  (SELECT wswscs.d_week_seq d_week_seq2,\n          sun_sales sun_sales2,\n          mon_sales mon_sales2,\n          tue_sales tue_sales2,\n          wed_sales wed_sales2,\n          thu_sales thu_sales2,\n          fri_sales fri_sales2,\n          sat_sales sat_sales2\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001+1) z\nWHERE d_week_seq1 = d_week_seq2-53\nORDER BY d_week_seq1 NULLS FIRST;\n", 
   "timings": [

   ],
   "children": [
    {
      "name": "RESULT_COLLECTOR",
      "timing":0.000000,
      "cardinality":0,
      "extra_info": "",
      "children": [
       {
         "name": "ORDER_BY",
         "timing":0.000199,
         "cardinality":0,
         "extra_info": "ORDERS:\ny.d_week_seq1 ASC",
         "children": [
          {
            "name": "PROJECTION",
            "timing":0.000702,
            "cardinality":2513,
            "extra_info": "d_week_seq1\nr1\nr2\nr3\nr4\nr5\nr6\nround((CAST(sat_sales1 AS DOUBLE) / CAST(sat_sales2 AS DOUBLE)), 2)\n",
            "children": [
             {
               "name": "HASH_JOIN",
               "timing":0.001816,
               "cardinality":2513,
               "extra_info": "INNER\n(d_week_seq2 - 53) = d_week_seq1\n\n[INFOSEPARATOR]\nEC: 1\n",
               "children": [
                {
                  "name": "PROJECTION",
                  "timing":0.000004,
                  "cardinality":365,
                  "extra_info": "d_week_seq2\nsun_sales2\nmon_sales2\ntue_sales2\nwed_sales2\nthu_sales2\nfri_sales2\nsat_sales2\n",
                  "children": [
                   {
                     "name": "HASH_JOIN",
                     "timing":0.000177,
                     "cardinality":365,
                     "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nEC: 1\n",
                     "children": [
                      {
                        "name": "DELTA_SCAN ",
                        "timing":0.062853,
                        "cardinality":365,
                        "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2002 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 0",
                        "children": [
                        ]
                      }
,
                      {
                        "name": "HASH_GROUP_BY",
                        "timing":4.083622,
                        "cardinality":264,
                        "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                        "children": [
                         {
                           "name": "PROJECTION",
                           "timing":9.297863,
                           "cardinality":215260465,
                           "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                           "children": [
                            {
                              "name": "HASH_JOIN",
                              "timing":7.306370,
                              "cardinality":215260465,
                              "extra_info": "INNER\nd_date_sk = sold_date_sk\n\n[INFOSEPARATOR]\nEC: 1\n",
                              "children": [
                               {
                                 "name": "DELTA_SCAN ",
                                 "timing":0.142226,
                                 "cardinality":73049,
                                 "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 0",
                                 "children": [
                                 ]
                               }
,
                               {
                                 "name": "UNION",
                                 "timing":0.000000,
                                 "cardinality":215998302,
                                 "extra_info": "",
                                 "children": [
                                  {
                                    "name": "DELTA_SCAN ",
                                    "timing":77.267345,
                                    "cardinality":72001237,
                                    "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 0",
                                    "children": [
                                    ]
                                  }
,
                                  {
                                    "name": "DELTA_SCAN ",
                                    "timing":192.599458,
                                    "cardinality":143997065,
                                    "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 0",
                                    "children": [
                                    ]
                                  }
                                 ]
                               }
                              ]
                            }
                           ]
                         }
                        ]
                      }
                     ]
                   }
                  ]
                }
,
                {
                  "name": "PROJECTION",
                  "timing":0.000005,
                  "cardinality":365,
                  "extra_info": "d_week_seq1\nsun_sales1\nmon_sales1\ntue_sales1\nwed_sales1\nthu_sales1\nfri_sales1\nsat_sales1\n",
                  "children": [
                   {
                     "name": "HASH_JOIN",
                     "timing":0.000188,
                     "cardinality":365,
                     "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nEC: 1\n",
                     "children": [
                      {
                        "name": "DELTA_SCAN ",
                        "timing":0.068732,
                        "cardinality":365,
                        "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2001 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 0",
                        "children": [
                        ]
                      }
,
                      {
                        "name": "HASH_GROUP_BY",
                        "timing":4.083108,
                        "cardinality":264,
                        "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                        "children": [
                         {
                           "name": "PROJECTION",
                           "timing":9.235393,
                           "cardinality":215260465,
                           "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                           "children": [
                            {
                              "name": "HASH_JOIN",
                              "timing":11.861454,
                              "cardinality":215260465,
                              "extra_info": "INNER\nd_date_sk = sold_date_sk\n\n[INFOSEPARATOR]\nEC: 1\n",
                              "children": [
                               {
                                 "name": "DELTA_SCAN ",
                                 "timing":0.094449,
                                 "cardinality":73049,
                                 "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 0",
                                 "children": [
                                 ]
                               }
,
                               {
                                 "name": "UNION",
                                 "timing":0.000000,
                                 "cardinality":215998302,
                                 "extra_info": "",
                                 "children": [
                                  {
                                    "name": "DELTA_SCAN ",
                                    "timing":14.193713,
                                    "cardinality":72001237,
                                    "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 0",
                                    "children": [
                                    ]
                                  }
,
                                  {
                                    "name": "DELTA_SCAN ",
                                    "timing":25.745380,
                                    "cardinality":143997065,
                                    "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 0",
                                    "children": [
                                    ]
                                  }
                                 ]
                               }
                              ]
                            }
                           ]
                         }
                        ]
                      }
                     ]
                   }
                  ]
                }
               ]
             }
            ]
          }
         ]
       }
      ]
    }
   ]
}

Parquet:

{
   "name":  "Query", 
   "result": 11.357916,
   "timing": 11.357916,
   "cardinality": 0,
   "extra-info": "WITH wscs AS\n  (SELECT sold_date_sk,\n          sales_price\n   FROM\n     (SELECT ws_sold_date_sk sold_date_sk,\n             ws_ext_sales_price sales_price\n      FROM web_sales\n      UNION ALL SELECT cs_sold_date_sk sold_date_sk,\n                       cs_ext_sales_price sales_price\n      FROM catalog_sales) sq1),\n     wswscs AS\n  (SELECT d_week_seq,\n          sum(CASE\n                  WHEN (d_day_name='Sunday') THEN sales_price\n                  ELSE NULL\n              END) sun_sales,\n          sum(CASE\n                  WHEN (d_day_name='Monday') THEN sales_price\n                  ELSE NULL\n              END) mon_sales,\n          sum(CASE\n                  WHEN (d_day_name='Tuesday') THEN sales_price\n                  ELSE NULL\n              END) tue_sales,\n          sum(CASE\n                  WHEN (d_day_name='Wednesday') THEN sales_price\n                  ELSE NULL\n              END) wed_sales,\n          sum(CASE\n                  WHEN (d_day_name='Thursday') THEN sales_price\n                  ELSE NULL\n              END) thu_sales,\n          sum(CASE\n                  WHEN (d_day_name='Friday') THEN sales_price\n                  ELSE NULL\n              END) fri_sales,\n          sum(CASE\n                  WHEN (d_day_name='Saturday') THEN sales_price\n                  ELSE NULL\n              END) sat_sales\n   FROM wscs,\n        date_dim\n   WHERE d_date_sk = sold_date_sk\n   GROUP BY d_week_seq)\nSELECT d_week_seq1,\n       round(sun_sales1/sun_sales2, 2) r1,\n       round(mon_sales1/mon_sales2, 2) r2,\n       round(tue_sales1/tue_sales2, 2) r3,\n       round(wed_sales1/wed_sales2, 2) r4,\n       round(thu_sales1/thu_sales2, 2) r5,\n       round(fri_sales1/fri_sales2, 2) r6,\n       round(sat_sales1/sat_sales2, 2)\nFROM\n  (SELECT wswscs.d_week_seq d_week_seq1,\n          sun_sales sun_sales1,\n          mon_sales mon_sales1,\n          tue_sales tue_sales1,\n          wed_sales wed_sales1,\n          thu_sales thu_sales1,\n          fri_sales fri_sales1,\n          sat_sales sat_sales1\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001) y,\n  (SELECT wswscs.d_week_seq d_week_seq2,\n          sun_sales sun_sales2,\n          mon_sales mon_sales2,\n          tue_sales tue_sales2,\n          wed_sales wed_sales2,\n          thu_sales thu_sales2,\n          fri_sales fri_sales2,\n          sat_sales sat_sales2\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001+1) z\nWHERE d_week_seq1 = d_week_seq2-53\nORDER BY d_week_seq1 NULLS FIRST;\n", 
   "timings": [

   ],
   "children": [
    {
      "name": "RESULT_COLLECTOR",
      "timing":0.000000,
      "cardinality":0,
      "extra_info": "",
      "children": [
       {
         "name": "PROJECTION",
         "timing":0.000000,
         "cardinality":0,
         "extra_info": "__internal_decompress_integral_integer(#0, 1)\n#1\n#2\n#3\n#4\n#5\n#6\n#7\n",
         "children": [
          {
            "name": "ORDER_BY",
            "timing":0.000262,
            "cardinality":0,
            "extra_info": "ORDERS:\ny.d_week_seq1 ASC",
            "children": [
             {
               "name": "PROJECTION",
               "timing":0.000020,
               "cardinality":2513,
               "extra_info": "__internal_compress_integral_usmallint(#0, 1)\n#1\n#2\n#3\n#4\n#5\n#6\n#7\n",
               "children": [
                {
                  "name": "PROJECTION",
                  "timing":0.000785,
                  "cardinality":2513,
                  "extra_info": "d_week_seq1\nr1\nr2\nr3\nr4\nr5\nr6\nround((CAST(sat_sales1 AS DOUBLE) / CAST(sat_sales2 AS DOUBLE)), 2)\n",
                  "children": [
                   {
                     "name": "HASH_JOIN",
                     "timing":0.000819,
                     "cardinality":2513,
                     "extra_info": "INNER\n(d_week_seq2 - 53) = d_week_seq1\n\n[INFOSEPARATOR]\nBuild Min: 1\nBuild Max: 10436\n\n[INFOSEPARATOR]\nEC: 2\n",
                     "children": [
                      {
                        "name": "PROJECTION",
                        "timing":0.000008,
                        "cardinality":365,
                        "extra_info": "d_week_seq2\nsun_sales2\nmon_sales2\ntue_sales2\nwed_sales2\nthu_sales2\nfri_sales2\nsat_sales2\n",
                        "children": [
                         {
                           "name": "HASH_JOIN",
                           "timing":0.000655,
                           "cardinality":365,
                           "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nBuild Min: 1\nBuild Max: 10436\n\n[INFOSEPARATOR]\nEC: 14609\n",
                           "children": [
                            {
                              "name": "PROJECTION",
                              "timing":0.000031,
                              "cardinality":264,
                              "extra_info": "__internal_decompress_integral_integer(#0, 1)\n#1\n#2\n#3\n#4\n#5\n#6\n#7\n",
                              "children": [
                               {
                                 "name": "HASH_GROUP_BY",
                                 "timing":3.573039,
                                 "cardinality":264,
                                 "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                                 "children": [
                                  {
                                    "name": "PROJECTION",
                                    "timing":3.424849,
                                    "cardinality":215260465,
                                    "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                                    "children": [
                                     {
                                       "name": "PROJECTION",
                                       "timing":0.173538,
                                       "cardinality":215260465,
                                       "extra_info": "#0\n#1\n#2\n__internal_compress_integral_usmallint(#3, 1)\n#4\n",
                                       "children": [
                                        {
                                          "name": "HASH_JOIN",
                                          "timing":3.155816,
                                          "cardinality":215260465,
                                          "extra_info": "INNER\nsold_date_sk = d_date_sk\n\n[INFOSEPARATOR]\nEC: 213319680\n",
                                          "children": [
                                           {
                                             "name": "UNION",
                                             "timing":0.000000,
                                             "cardinality":215998302,
                                             "extra_info": "",
                                             "children": [
                                              {
                                                "name": "READ_PARQUET ",
                                                "timing":164.682717,
                                                "cardinality":72001237,
                                                "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 72744960",
                                                "children": [
                                                ]
                                              }
,
                                              {
                                                "name": "READ_PARQUET ",
                                                "timing":325.301452,
                                                "cardinality":143997065,
                                                "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 140574720",
                                                "children": [
                                                ]
                                              }
                                             ]
                                           }
,
                                           {
                                             "name": "READ_PARQUET ",
                                             "timing":0.090834,
                                             "cardinality":73049,
                                             "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 73049",
                                             "children": [
                                             ]
                                           }
                                          ]
                                        }
                                       ]
                                     }
                                    ]
                                  }
                                 ]
                               }
                              ]
                            }
,
                            {
                              "name": "READ_PARQUET ",
                              "timing":0.057674,
                              "cardinality":365,
                              "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2002 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 14609",
                              "children": [
                              ]
                            }
                           ]
                         }
                        ]
                      }
,
                      {
                        "name": "PROJECTION",
                        "timing":0.000008,
                        "cardinality":365,
                        "extra_info": "d_week_seq1\nsun_sales1\nmon_sales1\ntue_sales1\nwed_sales1\nthu_sales1\nfri_sales1\nsat_sales1\n",
                        "children": [
                         {
                           "name": "HASH_JOIN",
                           "timing":0.000585,
                           "cardinality":365,
                           "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nBuild Min: 1\nBuild Max: 10436\n\n[INFOSEPARATOR]\nEC: 14609\n",
                           "children": [
                            {
                              "name": "PROJECTION",
                              "timing":0.000034,
                              "cardinality":264,
                              "extra_info": "__internal_decompress_integral_integer(#0, 1)\n#1\n#2\n#3\n#4\n#5\n#6\n#7\n",
                              "children": [
                               {
                                 "name": "HASH_GROUP_BY",
                                 "timing":3.754965,
                                 "cardinality":264,
                                 "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                                 "children": [
                                  {
                                    "name": "PROJECTION",
                                    "timing":3.702525,
                                    "cardinality":215260465,
                                    "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                                    "children": [
                                     {
                                       "name": "PROJECTION",
                                       "timing":0.196374,
                                       "cardinality":215260465,
                                       "extra_info": "#0\n#1\n#2\n__internal_compress_integral_usmallint(#3, 1)\n#4\n",
                                       "children": [
                                        {
                                          "name": "HASH_JOIN",
                                          "timing":3.356833,
                                          "cardinality":215260465,
                                          "extra_info": "INNER\nsold_date_sk = d_date_sk\n\n[INFOSEPARATOR]\nEC: 213319680\n",
                                          "children": [
                                           {
                                             "name": "UNION",
                                             "timing":0.000000,
                                             "cardinality":215998302,
                                             "extra_info": "",
                                             "children": [
                                              {
                                                "name": "READ_PARQUET ",
                                                "timing":54.102701,
                                                "cardinality":72001237,
                                                "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 72744960",
                                                "children": [
                                                ]
                                              }
,
                                              {
                                                "name": "READ_PARQUET ",
                                                "timing":102.858280,
                                                "cardinality":143997065,
                                                "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 140574720",
                                                "children": [
                                                ]
                                              }
                                             ]
                                           }
,
                                           {
                                             "name": "READ_PARQUET ",
                                             "timing":0.101143,
                                             "cardinality":73049,
                                             "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 73049",
                                             "children": [
                                             ]
                                           }
                                          ]
                                        }
                                       ]
                                     }
                                    ]
                                  }
                                 ]
                               }
                              ]
                            }
,
                            {
                              "name": "READ_PARQUET ",
                              "timing":0.056973,
                              "cardinality":365,
                              "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2001 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 14609",
                              "children": [
                              ]
                            }
                           ]
                         }
                        ]
                      }
                     ]
                   }
                  ]
                }
               ]
             }
            ]
          }
         ]
       }
      ]
    }
   ]
}

Using delta-rs:

{
   "name":  "Query", 
   "result": 28.112739,
   "timing": 28.112739,
   "cardinality": 0,
   "extra-info": "WITH wscs AS\n  (SELECT sold_date_sk,\n          sales_price\n   FROM\n     (SELECT ws_sold_date_sk sold_date_sk,\n             ws_ext_sales_price sales_price\n      FROM web_sales\n      UNION ALL SELECT cs_sold_date_sk sold_date_sk,\n                       cs_ext_sales_price sales_price\n      FROM catalog_sales) sq1),\n     wswscs AS\n  (SELECT d_week_seq,\n          sum(CASE\n                  WHEN (d_day_name='Sunday') THEN sales_price\n                  ELSE NULL\n              END) sun_sales,\n          sum(CASE\n                  WHEN (d_day_name='Monday') THEN sales_price\n                  ELSE NULL\n              END) mon_sales,\n          sum(CASE\n                  WHEN (d_day_name='Tuesday') THEN sales_price\n                  ELSE NULL\n              END) tue_sales,\n          sum(CASE\n                  WHEN (d_day_name='Wednesday') THEN sales_price\n                  ELSE NULL\n              END) wed_sales,\n          sum(CASE\n                  WHEN (d_day_name='Thursday') THEN sales_price\n                  ELSE NULL\n              END) thu_sales,\n          sum(CASE\n                  WHEN (d_day_name='Friday') THEN sales_price\n                  ELSE NULL\n              END) fri_sales,\n          sum(CASE\n                  WHEN (d_day_name='Saturday') THEN sales_price\n                  ELSE NULL\n              END) sat_sales\n   FROM wscs,\n        date_dim\n   WHERE d_date_sk = sold_date_sk\n   GROUP BY d_week_seq)\nSELECT d_week_seq1,\n       round(sun_sales1/sun_sales2, 2) r1,\n       round(mon_sales1/mon_sales2, 2) r2,\n       round(tue_sales1/tue_sales2, 2) r3,\n       round(wed_sales1/wed_sales2, 2) r4,\n       round(thu_sales1/thu_sales2, 2) r5,\n       round(fri_sales1/fri_sales2, 2) r6,\n       round(sat_sales1/sat_sales2, 2)\nFROM\n  (SELECT wswscs.d_week_seq d_week_seq1,\n          sun_sales sun_sales1,\n          mon_sales mon_sales1,\n          tue_sales tue_sales1,\n          wed_sales wed_sales1,\n          thu_sales thu_sales1,\n          fri_sales fri_sales1,\n          sat_sales sat_sales1\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001) y,\n  (SELECT wswscs.d_week_seq d_week_seq2,\n          sun_sales sun_sales2,\n          mon_sales mon_sales2,\n          tue_sales tue_sales2,\n          wed_sales wed_sales2,\n          thu_sales thu_sales2,\n          fri_sales fri_sales2,\n          sat_sales sat_sales2\n   FROM wswscs,\n        date_dim\n   WHERE date_dim.d_week_seq = wswscs.d_week_seq\n     AND d_year = 2001+1) z\nWHERE d_week_seq1 = d_week_seq2-53\nORDER BY d_week_seq1 NULLS FIRST;\n", 
   "timings": [

   ],
   "children": [
    {
      "name": "RESULT_COLLECTOR",
      "timing":0.000000,
      "cardinality":0,
      "extra_info": "",
      "children": [
       {
         "name": "ORDER_BY",
         "timing":0.000090,
         "cardinality":0,
         "extra_info": "ORDERS:\ny.d_week_seq1 ASC",
         "children": [
          {
            "name": "PROJECTION",
            "timing":0.000683,
            "cardinality":2513,
            "extra_info": "d_week_seq1\nr1\nr2\nr3\nr4\nr5\nr6\nround((CAST(sat_sales1 AS DOUBLE) / CAST(sat_sales2 AS DOUBLE)), 2)\n",
            "children": [
             {
               "name": "HASH_JOIN",
               "timing":0.000249,
               "cardinality":2513,
               "extra_info": "INNER\n(d_week_seq2 - 53) = d_week_seq1\n\n[INFOSEPARATOR]\nEC: 1\n",
               "children": [
                {
                  "name": "PROJECTION",
                  "timing":0.000002,
                  "cardinality":365,
                  "extra_info": "d_week_seq2\nsun_sales2\nmon_sales2\ntue_sales2\nwed_sales2\nthu_sales2\nfri_sales2\nsat_sales2\n",
                  "children": [
                   {
                     "name": "HASH_JOIN",
                     "timing":0.000207,
                     "cardinality":365,
                     "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nEC: 1\n",
                     "children": [
                      {
                        "name": "ARROW_SCAN ",
                        "timing":0.000016,
                        "cardinality":365,
                        "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2002 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 1",
                        "children": [
                        ]
                      }
,
                      {
                        "name": "HASH_GROUP_BY",
                        "timing":3.762454,
                        "cardinality":264,
                        "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                        "children": [
                         {
                           "name": "PROJECTION",
                           "timing":3.766164,
                           "cardinality":215260465,
                           "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                           "children": [
                            {
                              "name": "HASH_JOIN",
                              "timing":3.542502,
                              "cardinality":215260465,
                              "extra_info": "INNER\nsold_date_sk = d_date_sk\n\n[INFOSEPARATOR]\nEC: 2\n",
                              "children": [
                               {
                                 "name": "UNION",
                                 "timing":0.000000,
                                 "cardinality":215998302,
                                 "extra_info": "",
                                 "children": [
                                  {
                                    "name": "ARROW_SCAN ",
                                    "timing":110.987333,
                                    "cardinality":72001237,
                                    "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 1",
                                    "children": [
                                    ]
                                  }
,
                                  {
                                    "name": "ARROW_SCAN ",
                                    "timing":255.804889,
                                    "cardinality":143997065,
                                    "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 1",
                                    "children": [
                                    ]
                                  }
                                 ]
                               }
,
                               {
                                 "name": "ARROW_SCAN ",
                                 "timing":0.000527,
                                 "cardinality":73049,
                                 "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 1",
                                 "children": [
                                 ]
                               }
                              ]
                            }
                           ]
                         }
                        ]
                      }
                     ]
                   }
                  ]
                }
,
                {
                  "name": "PROJECTION",
                  "timing":0.000001,
                  "cardinality":365,
                  "extra_info": "d_week_seq1\nsun_sales1\nmon_sales1\ntue_sales1\nwed_sales1\nthu_sales1\nfri_sales1\nsat_sales1\n",
                  "children": [
                   {
                     "name": "HASH_JOIN",
                     "timing":0.000197,
                     "cardinality":365,
                     "extra_info": "INNER\nd_week_seq = d_week_seq\n\n[INFOSEPARATOR]\nEC: 1\n",
                     "children": [
                      {
                        "name": "ARROW_SCAN ",
                        "timing":0.000012,
                        "cardinality":365,
                        "extra_info": "d_week_seq\n[INFOSEPARATOR]\nFilters: d_year=2001 AND d_year IS NOT NULL\n\n[INFOSEPARATOR]\nEC: 1",
                        "children": [
                        ]
                      }
,
                      {
                        "name": "HASH_GROUP_BY",
                        "timing":3.542843,
                        "cardinality":264,
                        "extra_info": "#0\nsum(#1)\nsum(#2)\nsum(#3)\nsum(#4)\nsum(#5)\nsum(#6)\nsum(#7)",
                        "children": [
                         {
                           "name": "PROJECTION",
                           "timing":3.618487,
                           "cardinality":215260465,
                           "extra_info": "d_week_seq\nCASE  WHEN ((d_day_name = 'Sunday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Monday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Tuesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Wednesday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Thursday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Friday')) THEN (sales_price) ELSE NULL END\nCASE  WHEN ((d_day_name = 'Saturday')) THEN (sales_price) ELSE NULL END\n",
                           "children": [
                            {
                              "name": "HASH_JOIN",
                              "timing":3.332908,
                              "cardinality":215260465,
                              "extra_info": "INNER\nsold_date_sk = d_date_sk\n\n[INFOSEPARATOR]\nEC: 2\n",
                              "children": [
                               {
                                 "name": "UNION",
                                 "timing":0.000000,
                                 "cardinality":215998302,
                                 "extra_info": "",
                                 "children": [
                                  {
                                    "name": "ARROW_SCAN ",
                                    "timing":319.964375,
                                    "cardinality":72001237,
                                    "extra_info": "ws_sold_date_sk\nws_ext_sales_price\n[INFOSEPARATOR]\nEC: 1",
                                    "children": [
                                    ]
                                  }
,
                                  {
                                    "name": "ARROW_SCAN ",
                                    "timing":856.614842,
                                    "cardinality":143997065,
                                    "extra_info": "cs_sold_date_sk\ncs_ext_sales_price\n[INFOSEPARATOR]\nEC: 1",
                                    "children": [
                                    ]
                                  }
                                 ]
                               }
,
                               {
                                 "name": "ARROW_SCAN ",
                                 "timing":0.000500,
                                 "cardinality":73049,
                                 "extra_info": "d_date_sk\nd_week_seq\nd_day_name\n[INFOSEPARATOR]\nEC: 1",
                                 "children": [
                                 ]
                               }
                              ]
                            }
                           ]
                         }
                        ]
                      }
                     ]
                   }
                  ]
                }
               ]
             }
            ]
          }
         ]
       }
      ]
    }
   ]
}
lyogev commented 5 months ago

I would love to help fix this. Any help on where to start would be highly appreciated.

samansmink commented 5 months ago

Hey @lyogev thanks for the report! I've done some initial benchmarking but we haven't done any thorough optimizations yet. There might be some cases where delta could be a little slower than straight up parquet.

Could you share some details on your dataset and the precise query you are running on it?

lyogev commented 4 months ago

Hi @samansmink thanks for your help! So it took 41 seconds in delta vs 11 seconds on plain parquet, which is quite a difference. My guess is that it's related to the threads used to read the data from the remote storage (s3). I got bad performance over parquet as well, but once I increased my duckdb threads to 60 I got great performance (it just increased network throughput substantially). So I have to think that it's related to that somehow. I looked at the code and it seems it's using the the mutlifilereader as parquet uses, so it's interesting why it's slower in the reads. Or perhaps it's something else. See that even using the delta-rs lib I got better performance, so it must be related to the reader somehow.

I'm basically running query 2 from TPC-DS on top TPC-DS SF=100, one is stored in delta lake and the other in plain parquet.

mervynzhang commented 2 months ago

Hi @samansmink Is this related to hive_partitioning ? read_parquet have hive_partitioning = true to skip some partitions. delta_scan has fixed false https://github.com/duckdb/duckdb_delta/blob/e0af7f6a7ba8bc121c00ec61ed442ce44f46a242/src/functions/delta_scan.cpp#L517

samansmink commented 2 months ago

@mervynzhang no that should not be related the delta extension should not use the hive partitioning mechanism at all. I think this issue may be related to cardinality estimation which should be fixed in the next delta extension release

samansmink commented 2 months ago

closing this issue, it should be fixed in nightlies now and with the upcoming DuckDB v1.1.0 release