Performance issue relative to other access methods

daz10000 commented 6 months ago

I have an application pulling blocks of data (typically 100-1000 rows from a 20ish million row parquet file) with ~ 15 columns of which I use maybe 10. I am querying the parquet file with a simple

SELECT col1,col2 etc from read_parquet('/path/to/parquet.pq') where key= 'value'.

The code was originally python and with the equivalent python bindings, wrapping into duckdb.sql( <query>).df() it takes ~ 150 milliseconds per query. I run 1000s of these. I'm seeing more like 2.5 seconds with duckdb.net, with the bulk of the time spent in the executereader call. I also tried just a command line invocation with duckdb -c '<query>' and it's also around 150 milliseconds, so duckdb.net is the outlier. The data set isn't shareable but assuming I can recreate something similar, I can try to share something. The data aren't super interesting - lots of float and int columns mainly with a few small string cols. My access pattern is something like

#r "nuget:DuckDB.NET.Data.Full"
open DuckDB.NET.Data
open System

let connection = new DuckDBConnection("DataSource=:memory:")
connection.Open()

let getDataBlock (path:string) (key_value:string) =
    use command = connection.CreateCommand()
    command.CommandText <- $"SELECT col1,col2,col3,col4 etc FROM read_parquet('{path}') where key = '{key_value}'"
    [|  use reader = command.ExecuteReader()
        while reader.Read() do
            yield { Col1 = reader.GetInt32(0)
                    Col2 = reader.GetString(1)
                    Col3 = reader.GetDouble(3)
                    Col4 = reader.GetFloat(4) |> float
                    // etc - 12 cols
                    }
        |]

I guess short of building and sending a synthetic data set, I had a few questions - is this surprising? Should this be more than 10x slowed than the underlying duckdb libraries. Am I doing something really dumb here that is hurting performance? I tried async versions of the API and saw no improvements. It's the ExecuteReader call that takes 99% of the time. My profiling couldn't get inside the duckdb libraries but I'm going to try to build them and profile the whole things as a next step.

Any thoughts / help appreciated.

Giorgi commented 6 months ago

I will need data to check but I think that the sql method doesn't execute/fetch the results. Can you see how much time the following takes in Python? SELECT col1,col2 etc from read_parquet('/path/to/parquet.pq') where key= 'value'

daz10000 commented 6 months ago

Yes, when I first did the benchmarking with .sql() it was suspiciously fast - 3 milliseconds per fetch, and was suspicious it wasn't pulling the actual columns (and it wasn't). If you explicitly convert result to a dataframe with .df() it does the full fetch. I even made it calculate some summary statistics like average of a few columns to ensure it had actually fetched the data. That pushed it up to 150ms from 3ms. I don't know how duckdb.net is getting out to 2500 ms still. I am trying to get it built from source so I can push the profiling down a layer right now.

duckdb.sql("select col1,col2,col3 etc from read_parquet('<path.pq>) where key = 'value'").df()

Giorgi commented 6 months ago

How much time does duckdb.sql("select col1,col2,col3 etc from read_parquet('<path.pq>) where key = 'value'").fetchall() take?

Giorgi commented 6 months ago

You should not need to recompile the library to profile it, JetBrains dotTrace should be able to show how long each takes

daz10000 commented 6 months ago

My initial attempts couldn't see much of interest below the ExecuteReader call. I just profiled a run against the pulled and build github code and it's not much more informative other than it seems to go down into the duckdb.dll itself, but doesn't account for all the time. I'll take a better look when I'm not on a zoom call :( For what it's worth, it seems like most of the time goes to InitChunkData.

Giorgi commented 6 months ago

If you share parquet file (with wrangled data) I can try to look into it. By the way, calling async methods has no benefit because those methods aren't implemented and call the non-async code behind the scenes.

daz10000 commented 6 months ago

Digging a little deeper into the duckdb.dll call stack. Again not sure if the areas it's spending time makes sense. I wonder if there are any activities that are recurring that could be done once for example. I'll try to make you a random data set to play with and example code. Thanks for being responsive.

Giorgi commented 6 months ago

InitChunkData fetches the next data chunk from the query result by calling duckdb_stream_fetch_chunk or duckdb_result_get_chunk depending on the result type. I can try running the query with plain C API to see how much time it takes compared to .NET

Giorgi commented 6 months ago

Can you also try Line-by-line profiling?

daz10000 commented 6 months ago

Here is the line by line output. Let me see if I can build you a synthetic data set that reproduces the issue. I'm still a bit paranoid that the python/duckdb cli equivalents are cheating somehow because the output looks realistic - the low level IO calls are taking all the time, but there must be some different and even if the python/cli tool are cheating, I'd like the benefit of the shortcut

daz10000 commented 6 months ago

So with preface that there is a bit of rabbit hole here (more below), I built some code that can reproduce a dataset (since the dataset is big when made, it's probably kinder and more useful to give you the code rather than the output. There are three scripts in this gist

gen_synthetic.fsx is a piece of F# that can make test.pq. Just invoke it as dotnet fsi gen_synthetic.fsx and it will make the output in the folder (took a few minutes on my test machine). It will emit an FS3511 warning you can ignore. It also makes a keys.txt file with the keys for searching. The test.pq file was 26278428987 bytes for my test so 26ish Gb.

duckdb_bench.fsx is the F# test using duckdb.net. It's even slower on the synthetic data set (but so is the python). It was around 17 seconds per query on my test machine.

duckdb_bench.py is the python equivalent. It's 4.2 seconds, so slower than my real example but still 4x faster than duckdb.net.

Example

generating data and querying via F#, python and the command line tool.

[ec2-user@ip-10-0-17-138 duckdb_profiling]$ dotnet fsi gen_synthetic.fsx

/data2/tmp/duckdb_profiling/gen_synthetic.fsx(47,1): warning FS3511: This state machine is not statically compilable. A resumable code invocation at '(47,0--47,4)' could not be reduced. An alternative dynamic implementation will be used, which may be slower. Consider adjusting your code to ensure this state machine is statically compilable, or else suppress this warning.

[ec2-user@ duckdb_profiling]$ dotnet fsi duckdb_bench.fsx
Starting
FY32NZ: 994 17.3436603
KG0a: 1035 18.164819

[ec2-user@duckdb_profiling]$ ls
duckdb_bench.fsx  duckdb_bench.py  gen_synthetic.fsx  keys.txt  test.pq
[ec2-user@ duckdb_profiling]$ python3.11 duckdb_bench.py
FY32NZ: 994 4.62 seconds
KG0a: 1035 4.23 seconds
WoPLL4f: 998 4.35 seconds
6RP0: 1005 4.27 seconds
2c3L: 1045 4.19 seconds
f: 14117 4.29 seconds
louTJZK: 1018 4.23 seconds

 time duckdb -c "SELECT col1,col2,col3,col20,col21,col34, col8,col51,col60,col72,col43,col19 FROM read_parquet('test.pq') where col71 = 'FY32NZ'"
100% ▕████████████████████████████████████████████████████████████▏
┌─────────────────────┬────────┬────────┬───────┬──────────────────────┬────────┬───┬─────────┬─────────┬───────┬─────────────────────┬───────────┐
│        col1         │  col2  │  col3  │ col20 │        col21         │ col34  │ … │  col51  │  col60  │ col72 │        col43        │   col19   │
│       double        │ int64  │ int64  │ blob  │         blob         │ int64  │   │ boolean │ boolean │ blob  │       double        │   blob    │
├─────────────────────┼────────┼────────┼───────┼──────────────────────┼────────┼───┼─────────┼─────────┼───────┼─────────────────────┼───────────┤
│ 0.49871597213279306 │ 739587 │  55085 │ K1    │ u3IX9BtZtl2Pgtt5WH…  │ 798660 │ … │ false   │ false   │ 5r    │ 0.24071176254405058 │ 3HfZ436   │
│  0.3649178067378053 │ 826936 │ 129086 │ dN7   │ 474svWPGFXrC3GPmzJ…  │ 933806 │ … │ false   │ false   │ 8     │ 0.41325685787174715 │ 2m        │
│  0.6076152736496742 │ 544147 │ 495971 │ bIK   │ qIAs6qfJLCywTIXgmu…  │ 438720 │ … │ true    │ false   │ a     │  0.6094365338596414 │ xEX9kP    │
│  0.7069580230175578 │ 263992 │ 649813 │ FJ    │ G6nOUNRw0nvIX9QC8c…  │  67596 │ … │ false   │ true    │ fN    │  0.7137351812599049 │ I         │
│ 0.41429582393584763 │ 947931 │ 665398 │ cBG   │ xLjAqiVzLhzxVI3A2m…  │ 768745 │ … │ false   │ false   │ U     │ 0.16662012829022033 │ oBKic     │
│   0.856880729937517 │ 162367 │ 773955 │ Y     │ HHKS0Msf7bgNC1hbPa…  │ 569777 │ … │ false   │ false   │ zZ    │  0.7077931347891507 │ 9RYn3J66e │
│ 0.36372773950977977 │ 317753 │ 731348 │ lb9   │ wLXW0PJiTTolxOHZ0V…  │ 287591 │ … │ true    │ true    │ LX    │  0.5201957289005421 │ ioqKANw1  │
│ 0.04953325089556626 │ 206975 │ 364628 │ gtZ   │ k6ysvKGQxOd23ZTCtb…  │ 304643 │ … │ true    │ true    │ LH    │  0.1370955340085882 │ rnF7H8    │
│  0.3931929994494947 │ 475854 │  95522 │ 4OQ   │ vmlQvBkjFH2bz4USPt…  │ 447793 │ … │ true    │ false   │ U     │ 0.24523982856709903 │ gtyu8S    │
│  0.4472059783791561 │ 404745 │ 618708 │ p7f   │ bLNwvYOl6b1cx1pRU0…  │ 916101 │ … │ false   │ false   │ V     │  0.8822853475648034 │ 2f        │
│  0.9338920339561003 │ 937194 │ 560242 │ dUc   │ 7rkuDV4cQOWSSH9Z3y…  │ 623626 │ … │ true    │ true    │ 6P    │  0.8291028460967619 │ B7M3aD0   │
│  0.1545087930177168 │ 476587 │ 286969 │ T     │ 0aJNwJFTIZTNAKHJ5p…  │ 782475 │ … │ true    │ false   │ LH    │  0.9396593051633989 │ nyxEx     │
│  0.5052093211542894 │ 903692 │ 876326 │ xH    │ VYLUri7gXvu0tNfUC8…  │ 176441 │ … │ false   │ false   │ dB    │  0.4165140349961912 │ Ev4       │
│  0.9392755574986783 │  86270 │ 302064 │ Lf    │ KaAqgZWSI27G8YZec3…  │  76472 │ … │ false   │ true    │ fI    │  0.5341451462945567 │ Y4        │
│  0.9984677587374989 │ 689631 │ 674187 │ X     │ ed8BcC9aLrqy8IDdxj…  │ 189233 │ … │ true    │ true    │ CM    │  0.8622882776420094 │ 2nDWASV74 │
│  0.5187499201114347 │ 979113 │ 142311 │ AJ    │ cOiQ1HKAMaUpdsAM2d…  │ 789877 │ … │ false   │ false   │ 9     │ 0.09423379968300094 │ RWykIvlH3 │
│  0.5412472987194475 │ 381519 │ 505615 │ Nv    │ aJAbmIGlWZ44z4mvXX…  │  60353 │ … │ true    │ false   │ F     │  0.6469023530985042 │ 0xIV7WyX  │
│ 0.26862100516912146 │ 146416 │ 347085 │ S6    │ bjLP5OUZ8za6np9G9t…  │ 798171 │ … │ true    │ false   │ I     │ 0.18710059184629224 │ 1BjAkv4r  │
│  0.4798058339143766 │ 772096 │  30132 │ d8j   │ mFbtEQNI1DT2EaCYF4…  │ 714329 │ … │ false   │ true    │ f     │ 0.27392474593381955 │ UX0Cw     │
│  0.7307656983092111 │ 662843 │ 663838 │ fAu   │ j9UVlVrKSypo6Em320…  │ 564450 │ … │ false   │ true    │ w     │  0.4075195478660085 │ YleS      │
│           ·         │    ·   │    ·   │ ·     │          ·           │    ·   │ · │  ·      │  ·      │ ·     │           ·         │ ·         │
│           ·         │    ·   │    ·   │ ·     │          ·           │    ·   │ · │  ·      │  ·      │ ·     │           ·         │ ·         │
│           ·         │    ·   │    ·   │ ·     │          ·           │    ·   │ · │  ·      │  ·      │ ·     │           ·         │ ·         │
│ 0.12107519435156688 │ 775262 │ 473226 │ CU    │ KLEbatH6T2i3zm5KiE…  │ 536433 │ … │ true    │ false   │ hb    │ 0.28212048542341683 │ F         │
│  0.9821264383155804 │ 828469 │ 502264 │ sb    │ b5cXV9GL8Ib9TRL2qC…  │ 306053 │ … │ false   │ false   │ KL    │ 0.09958910496107343 │ Ibuy7sS   │
│  0.4818099612809347 │ 670793 │ 831538 │ kF    │ N86IcMrju6eFLwOAQZ…  │ 509699 │ … │ false   │ true    │ d     │ 0.20045554151241296 │ 0hqoQz    │
│ 0.41430143917756723 │ 552751 │ 385964 │ F     │ vGUv5bcrAt9U0KMrHl…  │ 230152 │ … │ false   │ false   │ wc    │ 0.06768997904253238 │ eH3XrE3f  │
│  0.3143632006978512 │ 368376 │ 973272 │ oO    │ tovxDffoxLlfg0HpTq…  │ 957813 │ … │ true    │ false   │ y     │ 0.14381938258783222 │ w         │
│  0.6884445698812717 │ 277454 │ 813113 │ 1p    │ Zncd73xxj5HY1tO7T6…  │ 601574 │ … │ false   │ true    │ bn    │  0.9956267516904547 │ kerA8aey  │
│  0.7832630091114771 │ 562535 │ 261492 │ 5w    │ PLgfS5OwJfjvpMYNKv…  │  28242 │ … │ false   │ true    │ h     │  0.4247233656664463 │ uWt9      │
│ 0.19236408324357424 │ 260571 │ 603569 │ Y9M   │ hh3JZjWcEBgaWJJZn5…  │ 773456 │ … │ true    │ false   │ p6    │  0.6373815429259283 │ le3wd1b   │
│  0.8578899593590746 │ 799447 │ 737540 │ K     │ VKLQVZdMs5x0LftcBx…  │ 776282 │ … │ true    │ false   │ Hs    │   0.796979680711421 │ lB6Wd4Am  │
│  0.4398278885273764 │ 304916 │  27378 │ Oga   │ 1u16FOTGBeqQLfzQTO…  │ 198072 │ … │ false   │ false   │ G     │  0.5160752166373654 │ AydRxw9   │
│    0.98124506244981 │ 270859 │ 165201 │ Wv    │ TwqFpcLEplL9RKHJrc…  │   2605 │ … │ true    │ true    │ T     │   0.925190009505502 │ HXomNs    │
│  0.3734382112823206 │ 306224 │  16040 │ GL    │ US8Q97KlUreH33Zxpt…  │ 763626 │ … │ false   │ false   │ F     │  0.9234632686480612 │ 3qYMNdEO  │
│  0.5622472076312106 │ 449180 │  48244 │ r     │ vAyW6GDDGPiaYsvEzg…  │ 327301 │ … │ false   │ false   │ r0    │ 0.20752904442041065 │ 0o3SxoI   │
│  0.9777380518459254 │ 114517 │ 613571 │ rx1   │ zC2YOjIsRcS8hbL6s3…  │ 470064 │ … │ false   │ false   │ 8     │  0.4478818935882599 │ ZWsGXjkxs │
│ 0.03672253297345174 │ 558108 │ 477328 │ 6yk   │ CfKdM9ltLcivDQOxzV…  │ 705420 │ … │ true    │ false   │ O     │ 0.37019634096062026 │ 6YYe      │
│  0.6855576119469486 │ 234337 │ 798569 │ K5    │ rWyKyhxeTbFRWWdEVM…  │ 158154 │ … │ true    │ false   │ 6     │ 0.11074547530981727 │ 2TN5      │
│  0.7965877995867655 │ 590932 │ 692964 │ C     │ V5yKwYtLdMYfTUvV5G…  │ 319774 │ … │ false   │ false   │ O     │ 0.49180707087991016 │ Npm68     │
│ 0.37782951903188466 │ 458597 │ 148218 │ NX    │ 2RisqwIkdPIgj9RGla…  │ 902059 │ … │ false   │ false   │ Fk    │ 0.07228979870122754 │ 0g1U      │
│  0.5479535777882614 │ 756906 │ 162240 │ N     │ AKSfAZpjmRoxvH8Ix1…  │  46801 │ … │ true    │ true    │ j     │  0.6226207699729053 │ f         │
│ 0.47620959351265546 │ 475873 │ 727248 │ er    │ WwCLcUwSIj5ozrCzw8…  │ 836221 │ … │ true    │ false   │ hL    │  0.9150087677095542 │ ApNEuYr   │
├─────────────────────┴────────┴────────┴───────┴──────────────────────┴────────┴───┴─────────┴─────────┴───────┴─────────────────────┴───────────┤
│ 994 rows (40 shown)                                                                                                       12 columns (11 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

real    0m4.430s
user    0m7.596s
sys     0m43.200s

Rabbit hole

One thing I discovered while simulating the data, is that my parquet file is not sorted on the key I am using for retrieval. Due to the way the original file was generated, you get clumps of values with the same key but they aren't all in one place. The obvious thing for me to do here is to sort the table - that should overall be kinder to duckdb. The simulated data is more randomly ordered - not even a little bit clumped, so I think that's why the retrieval times go up. Python still has an advantage here for some reason and the command line tool is about as fast (4.2ish seconds). I am going to presort the table so it's at least presented in an optimal fashion and will see what things do but I'm still interested in where the speed gap is.

daz10000 commented 6 months ago

Using a sorted version of the same parquet file and slightly smaller rowgroups (100k vs 1M rows per group) the dotnet version is way faster 8-10ms / query

[ec2-user@ip-10-0-17-138 duckdb_profiling]$ dotnet fsi duckdb_bench.fsx
Starting
FY32NZ: 994 0.2512582
KG0a: 1035 0.1448276
WoPLL4f: 998 0.0785816
6RP0: 1005 0.085683
2c3L: 1045 0.0636014
f: 14117 0.153188
louTJZK: 1018 0.0755844
bL: 1952 0.1267469
9m: 963 0.1319137
jLHE: 1007 0.1037502
orgU: 1024 0.078231
ihl0JbP: 1002 0.1105006
e86kejW: 984 0.120137
s: 13991 0.0919954
N9: 1026 0.1350647
Y18gS0: 1025 0.0789596
Nsw: 968 0.1406332
IL3sK: 994 0.1269129
D: 21913 0.1085171
bkbfn81: 1001 0.0724217
SLfsLJ: 1002 0.0831033
6aMhN: 1056 0.110397
9n6T: 993 0.1450154
KZUnyOD: 973 0.0795596
iaTwX: 1013 0.097228
bROwBcOUu: 972 0.1422997
MX: 1011 0.1383832
23Cn2fJZm: 1001 0.0824698
Fx2fpl: 1006 0.0852351
fs3pQ69: 981 0.0534778
WdspXhYHh: 1038 0.057838

and the python is comparable - note the first few queries are slower for both libs but thenspeed up and they seem very similar. So whatever the difference, it might only matter when there are lots of different pages getting retrieved.

FY32NZ: 994 0.43 seconds
KG0a: 1035 0.15 seconds
WoPLL4f: 998 0.07 seconds
6RP0: 1005 0.07 seconds
2c3L: 1045 0.06 seconds
f: 14117 0.14 seconds
louTJZK: 1018 0.07 seconds
bL: 1952 0.11 seconds
9m: 963 0.11 seconds
jLHE: 1007 0.10 seconds
orgU: 1024 0.07 seconds
ihl0JbP: 1002 0.10 seconds
e86kejW: 984 0.11 seconds
s: 13991 0.09 seconds
N9: 1026 0.11 seconds
Y18gS0: 1025 0.07 seconds
Nsw: 968 0.13 seconds
IL3sK: 994 0.11 seconds
D: 21913 0.09 seconds
bkbfn81: 1001 0.07 seconds
SLfsLJ: 1002 0.07 seconds
6aMhN: 1056 0.08 seconds
9n6T: 993 0.11 seconds
KZUnyOD: 973 0.07 seconds
iaTwX: 1013 0.08 seconds
bROwBcOUu: 972 0.11 seconds
MX: 1011 0.10 seconds
23Cn2fJZm: 1001 0.05 seconds
Fx2fpl: 1006 0.06 seconds
fs3pQ69: 981 0.05 seconds
WdspXhYHh: 1038 0.04 seconds
VN0nL: 1041 0.04 seconds
9YQaZv: 982 0.08 seconds
TDQIBsbJn: 1002 0.09 seconds
nfswdQ4H: 1029 0.05 seconds
76: 1960 0.09 seconds

Giorgi commented 6 months ago

@daz10000 I had to remove several columns from the query because I was getting OOM otherwise. By the way, have you joined the dotnet channel on the DuckDB Discord? We can chat there for faster communication.

Giorgi commented 6 months ago

@daz10000

Does it look better?

daz10000 commented 6 months ago

That looks way better - and looking forward to the new release coming up (thanks for the discord chat too). All appreciated. I think you can close this if you're satisfied and I'll grab the new release when it comes out.

Giorgi commented 6 months ago

I'll close this when I publish the updated package. By the way, the fix is already available from the GitHub package feed if you want to try it.

Giorgi commented 6 months ago

@daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time?

daz10000 commented 6 months ago

I saw that last night and took it for a test drive it works beautifully. much faster. thanks again !DarrenAm 5/22/24 um 1:21 PM schrieb Giorgi Dalakishvili @.***>: @daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Giorgi commented 6 months ago

Nice! I'm interested in how much difference the streaming vs non-streaming approach makes on your data. If you have time can you run the following 4 benchmarks? 2 with your old data and compare EnableStreamingMode true vs false, and 2 with your newer data (sorted file) and again EnableStreamingMode true vs false.

Thanks.

Giorgi / DuckDB.NET

Performance issue relative to other access methods #188

Example

Rabbit hole