Closed daz10000 closed 5 months ago
I will need data to check but I think that the sql
method doesn't execute/fetch the results. Can you see how much time the following takes in Python? SELECT col1,col2 etc from read_parquet('/path/to/parquet.pq') where key= 'value'
Yes, when I first did the benchmarking with .sql()
it was suspiciously fast - 3 milliseconds per fetch, and was suspicious it wasn't pulling the actual columns (and it wasn't). If you explicitly convert result to a dataframe with .df()
it does the full fetch. I even made it calculate some summary statistics like average of a few columns to ensure it had actually fetched the data. That pushed it up to 150ms from 3ms. I don't know how duckdb.net
is getting out to 2500 ms still. I am trying to get it built from source so I can push the profiling down a layer right now.
duckdb.sql("select col1,col2,col3 etc from read_parquet('<path.pq>) where key = 'value'").df()
How much time does duckdb.sql("select col1,col2,col3 etc from read_parquet('<path.pq>) where key = 'value'").fetchall()
take?
You should not need to recompile the library to profile it, JetBrains dotTrace should be able to show how long each takes
My initial attempts couldn't see much of interest below the ExecuteReader
call. I just profiled a run against the pulled and build github code and it's not much more informative other than it seems to go down into the duckdb.dll
itself, but doesn't account for all the time. I'll take a better look when I'm not on a zoom call :( For what it's worth, it seems like most of the time goes to InitChunkData
.
If you share parquet file (with wrangled data) I can try to look into it. By the way, calling async methods has no benefit because those methods aren't implemented and call the non-async code behind the scenes.
Digging a little deeper into the duckdb.dll
call stack. Again not sure if the areas it's spending time makes sense. I wonder if there are any activities that are recurring that could be done once for example. I'll try to make you a random data set to play with and example code. Thanks for being responsive.
InitChunkData
fetches the next data chunk from the query result by calling duckdb_stream_fetch_chunk
or duckdb_result_get_chunk
depending on the result type. I can try running the query with plain C API to see how much time it takes compared to .NET
Can you also try Line-by-line profiling?
Here is the line by line output. Let me see if I can build you a synthetic data set that reproduces the issue. I'm still a bit paranoid that the python/duckdb cli equivalents are cheating somehow because the output looks realistic - the low level IO calls are taking all the time, but there must be some different and even if the python/cli tool are cheating, I'd like the benefit of the shortcut
So with preface that there is a bit of rabbit hole here (more below), I built some code that can reproduce a dataset (since the dataset is big when made, it's probably kinder and more useful to give you the code rather than the output. There are three scripts in this gist
gen_synthetic.fsx
is a piece of F# that can make test.pq
. Just invoke it as dotnet fsi gen_synthetic.fsx
and it will make the output in the folder (took a few minutes on my test machine). It will emit an FS3511 warning you can ignore. It also makes a keys.txt
file with the keys for searching. The test.pq file was 26278428987
bytes for my test so 26ish Gb.
duckdb_bench.fsx
is the F# test using duckdb.net
. It's even slower on the synthetic data set (but so is the python). It was around 17 seconds per query on my test machine.
duckdb_bench.py
is the python equivalent. It's 4.2 seconds, so slower than my real example but still 4x faster than duckdb.net.
generating data and querying via F#, python and the command line tool.
[ec2-user@ip-10-0-17-138 duckdb_profiling]$ dotnet fsi gen_synthetic.fsx
/data2/tmp/duckdb_profiling/gen_synthetic.fsx(47,1): warning FS3511: This state machine is not statically compilable. A resumable code invocation at '(47,0--47,4)' could not be reduced. An alternative dynamic implementation will be used, which may be slower. Consider adjusting your code to ensure this state machine is statically compilable, or else suppress this warning.
[ec2-user@ duckdb_profiling]$ dotnet fsi duckdb_bench.fsx
Starting
FY32NZ: 994 17.3436603
KG0a: 1035 18.164819
[ec2-user@duckdb_profiling]$ ls
duckdb_bench.fsx duckdb_bench.py gen_synthetic.fsx keys.txt test.pq
[ec2-user@ duckdb_profiling]$ python3.11 duckdb_bench.py
FY32NZ: 994 4.62 seconds
KG0a: 1035 4.23 seconds
WoPLL4f: 998 4.35 seconds
6RP0: 1005 4.27 seconds
2c3L: 1045 4.19 seconds
f: 14117 4.29 seconds
louTJZK: 1018 4.23 seconds
time duckdb -c "SELECT col1,col2,col3,col20,col21,col34, col8,col51,col60,col72,col43,col19 FROM read_parquet('test.pq') where col71 = 'FY32NZ'"
100% ▕████████████████████████████████████████████████████████████▏
┌─────────────────────┬────────┬────────┬───────┬──────────────────────┬────────┬───┬─────────┬─────────┬───────┬─────────────────────┬───────────┐
│ col1 │ col2 │ col3 │ col20 │ col21 │ col34 │ … │ col51 │ col60 │ col72 │ col43 │ col19 │
│ double │ int64 │ int64 │ blob │ blob │ int64 │ │ boolean │ boolean │ blob │ double │ blob │
├─────────────────────┼────────┼────────┼───────┼──────────────────────┼────────┼───┼─────────┼─────────┼───────┼─────────────────────┼───────────┤
│ 0.49871597213279306 │ 739587 │ 55085 │ K1 │ u3IX9BtZtl2Pgtt5WH… │ 798660 │ … │ false │ false │ 5r │ 0.24071176254405058 │ 3HfZ436 │
│ 0.3649178067378053 │ 826936 │ 129086 │ dN7 │ 474svWPGFXrC3GPmzJ… │ 933806 │ … │ false │ false │ 8 │ 0.41325685787174715 │ 2m │
│ 0.6076152736496742 │ 544147 │ 495971 │ bIK │ qIAs6qfJLCywTIXgmu… │ 438720 │ … │ true │ false │ a │ 0.6094365338596414 │ xEX9kP │
│ 0.7069580230175578 │ 263992 │ 649813 │ FJ │ G6nOUNRw0nvIX9QC8c… │ 67596 │ … │ false │ true │ fN │ 0.7137351812599049 │ I │
│ 0.41429582393584763 │ 947931 │ 665398 │ cBG │ xLjAqiVzLhzxVI3A2m… │ 768745 │ … │ false │ false │ U │ 0.16662012829022033 │ oBKic │
│ 0.856880729937517 │ 162367 │ 773955 │ Y │ HHKS0Msf7bgNC1hbPa… │ 569777 │ … │ false │ false │ zZ │ 0.7077931347891507 │ 9RYn3J66e │
│ 0.36372773950977977 │ 317753 │ 731348 │ lb9 │ wLXW0PJiTTolxOHZ0V… │ 287591 │ … │ true │ true │ LX │ 0.5201957289005421 │ ioqKANw1 │
│ 0.04953325089556626 │ 206975 │ 364628 │ gtZ │ k6ysvKGQxOd23ZTCtb… │ 304643 │ … │ true │ true │ LH │ 0.1370955340085882 │ rnF7H8 │
│ 0.3931929994494947 │ 475854 │ 95522 │ 4OQ │ vmlQvBkjFH2bz4USPt… │ 447793 │ … │ true │ false │ U │ 0.24523982856709903 │ gtyu8S │
│ 0.4472059783791561 │ 404745 │ 618708 │ p7f │ bLNwvYOl6b1cx1pRU0… │ 916101 │ … │ false │ false │ V │ 0.8822853475648034 │ 2f │
│ 0.9338920339561003 │ 937194 │ 560242 │ dUc │ 7rkuDV4cQOWSSH9Z3y… │ 623626 │ … │ true │ true │ 6P │ 0.8291028460967619 │ B7M3aD0 │
│ 0.1545087930177168 │ 476587 │ 286969 │ T │ 0aJNwJFTIZTNAKHJ5p… │ 782475 │ … │ true │ false │ LH │ 0.9396593051633989 │ nyxEx │
│ 0.5052093211542894 │ 903692 │ 876326 │ xH │ VYLUri7gXvu0tNfUC8… │ 176441 │ … │ false │ false │ dB │ 0.4165140349961912 │ Ev4 │
│ 0.9392755574986783 │ 86270 │ 302064 │ Lf │ KaAqgZWSI27G8YZec3… │ 76472 │ … │ false │ true │ fI │ 0.5341451462945567 │ Y4 │
│ 0.9984677587374989 │ 689631 │ 674187 │ X │ ed8BcC9aLrqy8IDdxj… │ 189233 │ … │ true │ true │ CM │ 0.8622882776420094 │ 2nDWASV74 │
│ 0.5187499201114347 │ 979113 │ 142311 │ AJ │ cOiQ1HKAMaUpdsAM2d… │ 789877 │ … │ false │ false │ 9 │ 0.09423379968300094 │ RWykIvlH3 │
│ 0.5412472987194475 │ 381519 │ 505615 │ Nv │ aJAbmIGlWZ44z4mvXX… │ 60353 │ … │ true │ false │ F │ 0.6469023530985042 │ 0xIV7WyX │
│ 0.26862100516912146 │ 146416 │ 347085 │ S6 │ bjLP5OUZ8za6np9G9t… │ 798171 │ … │ true │ false │ I │ 0.18710059184629224 │ 1BjAkv4r │
│ 0.4798058339143766 │ 772096 │ 30132 │ d8j │ mFbtEQNI1DT2EaCYF4… │ 714329 │ … │ false │ true │ f │ 0.27392474593381955 │ UX0Cw │
│ 0.7307656983092111 │ 662843 │ 663838 │ fAu │ j9UVlVrKSypo6Em320… │ 564450 │ … │ false │ true │ w │ 0.4075195478660085 │ YleS │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │ · │
│ 0.12107519435156688 │ 775262 │ 473226 │ CU │ KLEbatH6T2i3zm5KiE… │ 536433 │ … │ true │ false │ hb │ 0.28212048542341683 │ F │
│ 0.9821264383155804 │ 828469 │ 502264 │ sb │ b5cXV9GL8Ib9TRL2qC… │ 306053 │ … │ false │ false │ KL │ 0.09958910496107343 │ Ibuy7sS │
│ 0.4818099612809347 │ 670793 │ 831538 │ kF │ N86IcMrju6eFLwOAQZ… │ 509699 │ … │ false │ true │ d │ 0.20045554151241296 │ 0hqoQz │
│ 0.41430143917756723 │ 552751 │ 385964 │ F │ vGUv5bcrAt9U0KMrHl… │ 230152 │ … │ false │ false │ wc │ 0.06768997904253238 │ eH3XrE3f │
│ 0.3143632006978512 │ 368376 │ 973272 │ oO │ tovxDffoxLlfg0HpTq… │ 957813 │ … │ true │ false │ y │ 0.14381938258783222 │ w │
│ 0.6884445698812717 │ 277454 │ 813113 │ 1p │ Zncd73xxj5HY1tO7T6… │ 601574 │ … │ false │ true │ bn │ 0.9956267516904547 │ kerA8aey │
│ 0.7832630091114771 │ 562535 │ 261492 │ 5w │ PLgfS5OwJfjvpMYNKv… │ 28242 │ … │ false │ true │ h │ 0.4247233656664463 │ uWt9 │
│ 0.19236408324357424 │ 260571 │ 603569 │ Y9M │ hh3JZjWcEBgaWJJZn5… │ 773456 │ … │ true │ false │ p6 │ 0.6373815429259283 │ le3wd1b │
│ 0.8578899593590746 │ 799447 │ 737540 │ K │ VKLQVZdMs5x0LftcBx… │ 776282 │ … │ true │ false │ Hs │ 0.796979680711421 │ lB6Wd4Am │
│ 0.4398278885273764 │ 304916 │ 27378 │ Oga │ 1u16FOTGBeqQLfzQTO… │ 198072 │ … │ false │ false │ G │ 0.5160752166373654 │ AydRxw9 │
│ 0.98124506244981 │ 270859 │ 165201 │ Wv │ TwqFpcLEplL9RKHJrc… │ 2605 │ … │ true │ true │ T │ 0.925190009505502 │ HXomNs │
│ 0.3734382112823206 │ 306224 │ 16040 │ GL │ US8Q97KlUreH33Zxpt… │ 763626 │ … │ false │ false │ F │ 0.9234632686480612 │ 3qYMNdEO │
│ 0.5622472076312106 │ 449180 │ 48244 │ r │ vAyW6GDDGPiaYsvEzg… │ 327301 │ … │ false │ false │ r0 │ 0.20752904442041065 │ 0o3SxoI │
│ 0.9777380518459254 │ 114517 │ 613571 │ rx1 │ zC2YOjIsRcS8hbL6s3… │ 470064 │ … │ false │ false │ 8 │ 0.4478818935882599 │ ZWsGXjkxs │
│ 0.03672253297345174 │ 558108 │ 477328 │ 6yk │ CfKdM9ltLcivDQOxzV… │ 705420 │ … │ true │ false │ O │ 0.37019634096062026 │ 6YYe │
│ 0.6855576119469486 │ 234337 │ 798569 │ K5 │ rWyKyhxeTbFRWWdEVM… │ 158154 │ … │ true │ false │ 6 │ 0.11074547530981727 │ 2TN5 │
│ 0.7965877995867655 │ 590932 │ 692964 │ C │ V5yKwYtLdMYfTUvV5G… │ 319774 │ … │ false │ false │ O │ 0.49180707087991016 │ Npm68 │
│ 0.37782951903188466 │ 458597 │ 148218 │ NX │ 2RisqwIkdPIgj9RGla… │ 902059 │ … │ false │ false │ Fk │ 0.07228979870122754 │ 0g1U │
│ 0.5479535777882614 │ 756906 │ 162240 │ N │ AKSfAZpjmRoxvH8Ix1… │ 46801 │ … │ true │ true │ j │ 0.6226207699729053 │ f │
│ 0.47620959351265546 │ 475873 │ 727248 │ er │ WwCLcUwSIj5ozrCzw8… │ 836221 │ … │ true │ false │ hL │ 0.9150087677095542 │ ApNEuYr │
├─────────────────────┴────────┴────────┴───────┴──────────────────────┴────────┴───┴─────────┴─────────┴───────┴─────────────────────┴───────────┤
│ 994 rows (40 shown) 12 columns (11 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
real 0m4.430s
user 0m7.596s
sys 0m43.200s
One thing I discovered while simulating the data, is that my parquet file is not sorted on the key I am using for retrieval. Due to the way the original file was generated, you get clumps of values with the same key but they aren't all in one place. The obvious thing for me to do here is to sort the table - that should overall be kinder to duckdb. The simulated data is more randomly ordered - not even a little bit clumped, so I think that's why the retrieval times go up. Python still has an advantage here for some reason and the command line tool is about as fast (4.2ish seconds). I am going to presort the table so it's at least presented in an optimal fashion and will see what things do but I'm still interested in where the speed gap is.
Using a sorted version of the same parquet file and slightly smaller rowgroups (100k vs 1M rows per group) the dotnet version is way faster 8-10ms / query
[ec2-user@ip-10-0-17-138 duckdb_profiling]$ dotnet fsi duckdb_bench.fsx
Starting
FY32NZ: 994 0.2512582
KG0a: 1035 0.1448276
WoPLL4f: 998 0.0785816
6RP0: 1005 0.085683
2c3L: 1045 0.0636014
f: 14117 0.153188
louTJZK: 1018 0.0755844
bL: 1952 0.1267469
9m: 963 0.1319137
jLHE: 1007 0.1037502
orgU: 1024 0.078231
ihl0JbP: 1002 0.1105006
e86kejW: 984 0.120137
s: 13991 0.0919954
N9: 1026 0.1350647
Y18gS0: 1025 0.0789596
Nsw: 968 0.1406332
IL3sK: 994 0.1269129
D: 21913 0.1085171
bkbfn81: 1001 0.0724217
SLfsLJ: 1002 0.0831033
6aMhN: 1056 0.110397
9n6T: 993 0.1450154
KZUnyOD: 973 0.0795596
iaTwX: 1013 0.097228
bROwBcOUu: 972 0.1422997
MX: 1011 0.1383832
23Cn2fJZm: 1001 0.0824698
Fx2fpl: 1006 0.0852351
fs3pQ69: 981 0.0534778
WdspXhYHh: 1038 0.057838
and the python is comparable - note the first few queries are slower for both libs but thenspeed up and they seem very similar. So whatever the difference, it might only matter when there are lots of different pages getting retrieved.
FY32NZ: 994 0.43 seconds
KG0a: 1035 0.15 seconds
WoPLL4f: 998 0.07 seconds
6RP0: 1005 0.07 seconds
2c3L: 1045 0.06 seconds
f: 14117 0.14 seconds
louTJZK: 1018 0.07 seconds
bL: 1952 0.11 seconds
9m: 963 0.11 seconds
jLHE: 1007 0.10 seconds
orgU: 1024 0.07 seconds
ihl0JbP: 1002 0.10 seconds
e86kejW: 984 0.11 seconds
s: 13991 0.09 seconds
N9: 1026 0.11 seconds
Y18gS0: 1025 0.07 seconds
Nsw: 968 0.13 seconds
IL3sK: 994 0.11 seconds
D: 21913 0.09 seconds
bkbfn81: 1001 0.07 seconds
SLfsLJ: 1002 0.07 seconds
6aMhN: 1056 0.08 seconds
9n6T: 993 0.11 seconds
KZUnyOD: 973 0.07 seconds
iaTwX: 1013 0.08 seconds
bROwBcOUu: 972 0.11 seconds
MX: 1011 0.10 seconds
23Cn2fJZm: 1001 0.05 seconds
Fx2fpl: 1006 0.06 seconds
fs3pQ69: 981 0.05 seconds
WdspXhYHh: 1038 0.04 seconds
VN0nL: 1041 0.04 seconds
9YQaZv: 982 0.08 seconds
TDQIBsbJn: 1002 0.09 seconds
nfswdQ4H: 1029 0.05 seconds
76: 1960 0.09 seconds
@daz10000 I had to remove several columns from the query because I was getting OOM otherwise. By the way, have you joined the dotnet channel on the DuckDB Discord? We can chat there for faster communication.
@daz10000
Does it look better?
That looks way better - and looking forward to the new release coming up (thanks for the discord chat too). All appreciated. I think you can close this if you're satisfied and I'll grab the new release when it comes out.
I'll close this when I publish the updated package. By the way, the fix is already available from the GitHub package feed if you want to try it.
@daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time?
I saw that last night and took it for a test drive it works beautifully. much faster. thanks again !DarrenAm 5/22/24 um 1:21 PM schrieb Giorgi Dalakishvili @.***>: @daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Nice! I'm interested in how much difference the streaming vs non-streaming approach makes on your data. If you have time can you run the following 4 benchmarks? 2 with your old data and compare EnableStreamingMode
true vs false, and 2 with your newer data (sorted file) and again EnableStreamingMode
true vs false.
Thanks.
I have an application pulling blocks of data (typically 100-1000 rows from a 20ish million row parquet file) with ~ 15 columns of which I use maybe 10. I am querying the parquet file with a simple
SELECT col1,col2 etc from read_parquet('/path/to/parquet.pq') where key= 'value'
.The code was originally python and with the equivalent python bindings, wrapping into
duckdb.sql( <query>).df()
it takes ~ 150 milliseconds per query. I run 1000s of these. I'm seeing more like 2.5 seconds with duckdb.net, with the bulk of the time spent in the executereader call. I also tried just a command line invocation withduckdb -c '<query>'
and it's also around 150 milliseconds, so duckdb.net is the outlier. The data set isn't shareable but assuming I can recreate something similar, I can try to share something. The data aren't super interesting - lots of float and int columns mainly with a few small string cols. My access pattern is something likeI guess short of building and sending a synthetic data set, I had a few questions - is this surprising? Should this be more than 10x slowed than the underlying duckdb libraries. Am I doing something really dumb here that is hurting performance? I tried async versions of the API and saw no improvements. It's the
ExecuteReader
call that takes 99% of the time. My profiling couldn't get inside the duckdb libraries but I'm going to try to build them and profile the whole things as a next step.Any thoughts / help appreciated.