Open alberttwong opened 7 months ago
Using Partition statistics would be even faster but are new, and are optional to write, the default way e.g. Trino is often doing is doing a Iceberg file scan and reading the metadata (which essentially opens the manifests etc) which can take a few seconds from S3 depending. Should be quite fast if e.g. Iceberg metadata is cached in starrocks.
looks like the infra there, just not for the iceberg
@stephen-shelby
Mykhail Martsyniuk 6 hours ago Hello guys. I was comparing performance on queries over iceberg datalake, and got weird results. On 6 billion rows table, simple select count(*) from table query takes 4 minutes for starrocks. While same query on fresh trino cluster takes about 15s. (both clusters are 2 nodes in same region, even though im not sure if BE is participating in datalake queries if no cache involved). Am i missing some config?
Albert Wong :starrocks: 26 minutes ago Count is a horrible way to test performance unless that is all your app does
Albert Wong :starrocks: 25 minutes ago There is a page on our docs page to use bitmap for counts
Mykhail Martsyniuk 24 minutes ago yes, definitely. But this is also most basic thing people do.. I wonder why there is so much difference. I don't think trino can read 300GB of data within 15seconds. They utilizing iceberg metadata, i guess
Albert Wong :starrocks: 24 minutes ago What if I told you that there are a lot of ways for projects to play with count
Albert Wong :starrocks: 23 minutes ago I’m not even sure if iceberg hold that metadata. Hudi does
Mykhail Martsyniuk 12 minutes ago actually it seems it has: https://iceberg.apache.org/spec/#partition-statistics-file
Albert Wong :starrocks: 8 minutes ago If it stores that data in metatable why would it still take 15 seconds? 15 seconds to look up one field????
Mykhail Martsyniuk 6 minutes ago it's per partition, with more than 100 partitions.. still a lot of time though.. not sure. What i did is i spawned fresh cluster on starburst. So, there was no cache whatsoever. In comparison AWS Athena does the same in 3 seconds
Mykhail Martsyniuk 4 minutes ago in any case, i will proceed with more relevant queries. Just thought maybe it could be helpful for you :slightly_smiling_face: