duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
50 stars 17 forks source link

HTTP stats over counting total_bytes_received? #65

Open mmaitre314 opened 4 months ago

mmaitre314 commented 4 months ago

I am trying to optimize a query and noticed that the HTTP stats in EXPLAIN ANALYZE statements seem to be off. I query one Parquet file with 10.79 GiB and the HTTP stats mention reading 32.7 GiB. I am wondering whether http_state_policy.cpp could be over-counting total_bytes_received, and in particular including values from the content-length HTTP header of HEAD requests.

SET azure_transport_option_type = curl;
SET azure_http_stats = True;
SET threads = 1;
SET azure_read_transfer_concurrency = 1;
SET azure_read_transfer_chunk_size = 1024 * 1024;
SET azure_read_buffer_size = 1024 * 1024;

EXPLAIN ANALYZE SELECT col1 FROM 'az://<snip>.blob.core.windows.net/<snip>.parquet' LIMIT 1 
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││            HTTP Stats:            ││
││                                   ││
││            in: 32.7 GiB           ││
││            out: 0 bytes           ││
││              #HEAD: 3             ││
││             #GET: 354             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘

In the Azure SDK logs I see 3 HEAD requests with content-length : 11583653237 and 349 GET requests with content-length : 1048576. So the total input data should be around 0.34 GiB instead of 32.7 GiB.

If this analysis is correct, I can send a small PR to fix.

mmaitre314 commented 4 months ago

@quentingodeau - does this analysis sound correct? Happy to send a fix if it does.

quentingodeau commented 3 months ago

Hello, sry for the late reply, I will double check and do the fix. Thx a lot for the analyses!