GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
654 stars 38 forks source link

improve formatting for tables with long strings #2176

Open universalmind303 opened 10 months ago

universalmind303 commented 10 months ago

Description

The width detection logic still seems to be a bit off.

It seems to sample from the front instead of evenly.

for example, i'd expect this to render a table of the same character width

\rows 1000
select * exclude(parameters) from glare_catalog.functions order by function_name asc;
select * exclude(parameters) from glare_catalog.functions order by function_name desc;

However, when descending order, it truncates APPROX_PERCENTILE_CONT_WITH_WEIGHT and APPROX_PERCENTILE_CONT.

Setting \max-width has no effect.

Additionally, on fields with very long strings and newlines, it produces nearly illegible tables.

> select example, description from function_info where function_type = 'table' and example is not null;
┌──────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────┐
│ example                                                                                          │ description │
│ ──                                                                                               │ ──          │
│ Utf8                                                                                             │ Utf8        │
╞══════════════════════════════════════════════════════════════════════════════════════════════════╪═════════════╡
│ ↵                                                                                                │ ↵           │
│ -- Read a relative path.↵                                                                        │ Syntax:↵    │
│ SELECT * FROM parquet_scan('./my_data.parquet');↵                                                │ -- Single↵  │
│ -- Read all parquet files in a directory.↵                                                       │ url or↵     │
│ SELECT * FROM parquet_scan('./directory_of_data/*.parquet');↵                                    │ path.↵      │
│ -- Read parquet files from multiple directories.↵                                                │ parquet_sc↵ │
│ SELECT * FROM parquet_scan('./**/*.parquet');↵                                                   │ an(<url>)↵  │
│ -- Read multiple explicitly provided files.↵                                                     │ --↵         │
│ SELECT * FROM parquet_scan(['./directory_of_data/1.parquet', './directory_of_data/2.parquet']);↵ │ Multiple↵   │
│                                                                                                  │ urls or↵    │
│                                                                                                  │ paths.↵     │
│                                                                                                  │ parquet_sc↵ │
│                                                                                                  │ an([<url>]↵ │
│                                                                                                  │ )↵          │
│                                                                                                  │ -- Using a↵ │
│                                                                                                  │ cloud↵      │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s object.↵  │
│                                                                                                  │ parquet_sc↵ │
│                                                                                                  │ an(<url>,↵  │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ l_object>)↵ │
│                                                                                                  │ --↵         │
│                                                                                                  │ Required↵   │
│                                                                                                  │ named↵      │
│                                                                                                  │ argument↵   │
│                                                                                                  │ for S3↵     │
│                                                                                                  │ buckets.↵   │
│                                                                                                  │ parquet_sc↵ │
│                                                                                                  │ an(<url>,↵  │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ ls_object>↵ │
│                                                                                                  │ , region↵   │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass S3↵ │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ parquet_sc↵ │
│                                                                                                  │ an(<url>,↵  │
│                                                                                                  │ access_key↵ │
│                                                                                                  │ _id =>↵     │
│                                                                                                  │ '<aws_acce↵ │
│                                                                                                  │ ss_key_id>↵ │
│                                                                                                  │ ',↵         │
│                                                                                                  │ secret_acc↵ │
│                                                                                                  │ ess_key =>↵ │
│                                                                                                  │ '<aws_secr↵ │
│                                                                                                  │ et_access_↵ │
│                                                                                                  │ key>',↵     │
│                                                                                                  │ region =>↵  │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass↵    │
│                                                                                                  │ GCS↵        │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ parquet_sc↵ │
│                                                                                                  │ an(<url>,↵  │
│                                                                                                  │ service_ac↵ │
│                                                                                                  │ count_key↵  │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<gcp_serv↵ │
│                                                                                                  │ ice_accoun↵ │
│                                                                                                  │ t_key>'     │
│ ↵                                                                                                │ ↵           │
│ -- Read a relative path.↵                                                                        │ Syntax:↵    │
│ SELECT * FROM csv_scan('./my_data.csv');↵                                                        │ -- Single↵  │
│ -- Read all csv files in a directory.↵                                                           │ url or↵     │
│ SELECT * FROM csv_scan('./directory_of_data/*.csv');↵                                            │ path.↵      │
│ -- Read csv files from multiple directories.↵                                                    │ csv_scan(<↵ │
│ SELECT * FROM csv_scan('./**/*.csv');↵                                                           │ url>)↵      │
│ -- Read multiple explicitly provided files.↵                                                     │ --↵         │
│ SELECT * FROM csv_scan(['./directory_of_data/1.csv', './directory_of_data/2.csv']);↵             │ Multiple↵   │
│                                                                                                  │ urls or↵    │
│                                                                                                  │ paths.↵     │
│                                                                                                  │ csv_scan([↵ │
│                                                                                                  │ <url>])↵    │
│                                                                                                  │ -- Using a↵ │
│                                                                                                  │ cloud↵      │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s object.↵  │
│                                                                                                  │ csv_scan(<↵ │
│                                                                                                  │ url>,↵      │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ l_object>)↵ │
│                                                                                                  │ --↵         │
│                                                                                                  │ Required↵   │
│                                                                                                  │ named↵      │
│                                                                                                  │ argument↵   │
│                                                                                                  │ for S3↵     │
│                                                                                                  │ buckets.↵   │
│                                                                                                  │ csv_scan(<↵ │
│                                                                                                  │ url>,↵      │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ ls_object>↵ │
│                                                                                                  │ , region↵   │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass S3↵ │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ csv_scan(<↵ │
│                                                                                                  │ url>,↵      │
│                                                                                                  │ access_key↵ │
│                                                                                                  │ _id =>↵     │
│                                                                                                  │ '<aws_acce↵ │
│                                                                                                  │ ss_key_id>↵ │
│                                                                                                  │ ',↵         │
│                                                                                                  │ secret_acc↵ │
│                                                                                                  │ ess_key =>↵ │
│                                                                                                  │ '<aws_secr↵ │
│                                                                                                  │ et_access_↵ │
│                                                                                                  │ key>',↵     │
│                                                                                                  │ region =>↵  │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass↵    │
│                                                                                                  │ GCS↵        │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ csv_scan(<↵ │
│                                                                                                  │ url>,↵      │
│                                                                                                  │ service_ac↵ │
│                                                                                                  │ count_key↵  │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<gcp_serv↵ │
│                                                                                                  │ ice_accoun↵ │
│                                                                                                  │ t_key>'     │
│ ↵                                                                                                │ ↵           │
│ -- Read a relative path.↵                                                                        │ Syntax:↵    │
│ SELECT * FROM json_scan('./my_data.json');↵                                                      │ -- Single↵  │
│ -- Read all json files in a directory.↵                                                          │ url or↵     │
│ SELECT * FROM json_scan('./directory_of_data/*.json');↵                                          │ path.↵      │
│ -- Read json files from multiple directories.↵                                                   │ json_scan(↵ │
│ SELECT * FROM json_scan('./**/*.json');↵                                                         │ <url>)↵     │
│ -- Read multiple explicitly provided files.↵                                                     │ --↵         │
│ SELECT * FROM json_scan(['./directory_of_data/1.json', './directory_of_data/2.json']);↵          │ Multiple↵   │
│                                                                                                  │ urls or↵    │
│                                                                                                  │ paths.↵     │
│                                                                                                  │ json_scan(↵ │
│                                                                                                  │ [<url>])↵   │
│                                                                                                  │ -- Using a↵ │
│                                                                                                  │ cloud↵      │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s object.↵  │
│                                                                                                  │ json_scan(↵ │
│                                                                                                  │ <url>,↵     │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ l_object>)↵ │
│                                                                                                  │ --↵         │
│                                                                                                  │ Required↵   │
│                                                                                                  │ named↵      │
│                                                                                                  │ argument↵   │
│                                                                                                  │ for S3↵     │
│                                                                                                  │ buckets.↵   │
│                                                                                                  │ json_scan(↵ │
│                                                                                                  │ <url>,↵     │
│                                                                                                  │ <credentia↵ │
│                                                                                                  │ ls_object>↵ │
│                                                                                                  │ , region↵   │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass S3↵ │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ json_scan(↵ │
│                                                                                                  │ <url>,↵     │
│                                                                                                  │ access_key↵ │
│                                                                                                  │ _id =>↵     │
│                                                                                                  │ '<aws_acce↵ │
│                                                                                                  │ ss_key_id>↵ │
│                                                                                                  │ ',↵         │
│                                                                                                  │ secret_acc↵ │
│                                                                                                  │ ess_key =>↵ │
│                                                                                                  │ '<aws_secr↵ │
│                                                                                                  │ et_access_↵ │
│                                                                                                  │ key>',↵     │
│                                                                                                  │ region =>↵  │
│                                                                                                  │ '<aws_regi↵ │
│                                                                                                  │ on>')↵      │
│                                                                                                  │ -- Pass↵    │
│                                                                                                  │ GCS↵        │
│                                                                                                  │ credential↵ │
│                                                                                                  │ s using↵    │
│                                                                                                  │ named↵      │
│                                                                                                  │ arguments.↵ │
│                                                                                                  │ json_scan(↵ │
│                                                                                                  │ <url>,↵     │
│                                                                                                  │ service_ac↵ │
│                                                                                                  │ count_key↵  │
│                                                                                                  │ =>↵         │
│                                                                                                  │ '<gcp_serv↵ │
│                                                                                                  │ ice_accoun↵ │
│                                                                                                  │ t_key>'     │
└──────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘

It seems to work fine when there is 1 column though

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ description                                                                                                                            │
│ ──                                                                                                                                     │
│ Utf8                                                                                                                                   │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ ↵                                                                                                                                      │
│ Syntax:↵                                                                                                                               │
│ -- Single url or path.↵                                                                                                                │
│ parquet_scan(<url>)↵                                                                                                                   │
│ -- Multiple urls or paths.↵                                                                                                            │
│ parquet_scan([<url>])↵                                                                                                                 │
│ -- Using a cloud credentials object.↵                                                                                                  │
│ parquet_scan(<url>, <credential_object>)↵                                                                                              │
│ -- Required named argument for S3 buckets.↵                                                                                            │
│ parquet_scan(<url>, <credentials_object>, region => '<aws_region>')↵                                                                   │
│ -- Pass S3 credentials using named arguments.↵                                                                                         │
│ parquet_scan(<url>, access_key_id => '<aws_access_key_id>', secret_access_key => '<aws_secret_access_key>', region => '<aws_region>')↵ │
│ -- Pass GCS credentials using named arguments.↵                                                                                        │
│ parquet_scan(<url>, service_account_key => '<gcp_service_account_key>'                                                                 │
│ ↵                                                                                                                                      │
│ Syntax:↵                                                                                                                               │
│ -- Single url or path.↵                                                                                                                │
│ csv_scan(<url>)↵                                                                                                                       │
│ -- Multiple urls or paths.↵                                                                                                            │
│ csv_scan([<url>])↵                                                                                                                     │
│ -- Using a cloud credentials object.↵                                                                                                  │
│ csv_scan(<url>, <credential_object>)↵                                                                                                  │
│ -- Required named argument for S3 buckets.↵                                                                                            │
│ csv_scan(<url>, <credentials_object>, region => '<aws_region>')↵                                                                       │
│ -- Pass S3 credentials using named arguments.↵                                                                                         │
│ csv_scan(<url>, access_key_id => '<aws_access_key_id>', secret_access_key => '<aws_secret_access_key>', region => '<aws_region>')↵     │
│ -- Pass GCS credentials using named arguments.↵                                                                                        │
│ csv_scan(<url>, service_account_key => '<gcp_service_account_key>'                                                                     │
│ ↵                                                                                                                                      │
│ Syntax:↵                                                                                                                               │
│ -- Single url or path.↵                                                                                                                │
│ json_scan(<url>)↵                                                                                                                      │
│ -- Multiple urls or paths.↵                                                                                                            │
│ json_scan([<url>])↵                                                                                                                    │
│ -- Using a cloud credentials object.↵                                                                                                  │
│ json_scan(<url>, <credential_object>)↵                                                                                                 │
│ -- Required named argument for S3 buckets.↵                                                                                            │
│ json_scan(<url>, <credentials_object>, region => '<aws_region>')↵                                                                      │
│ -- Pass S3 credentials using named arguments.↵                                                                                         │
│ json_scan(<url>, access_key_id => '<aws_access_key_id>', secret_access_key => '<aws_secret_access_key>', region => '<aws_region>')↵    │
│ -- Pass GCS credentials using named arguments.↵                                                                                        │
│ json_scan(<url>, service_account_key => '<gcp_service_account_key>'                                                                    │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
universalmind303 commented 10 months ago

Additionally, it'd be nice to do something like polars. They have the env var POLARS_FMT_STR_LEN which allows you to manually set the max str length to use.

scsmithr commented 10 months ago

It seems to sample from the front instead of evenly.

for example, i'd expect this to render a table of the same character width

\rows 1000 select exclude(parameters) from glare_catalog.functions order by function_name asc; select exclude(parameters) from glare_catalog.functions order by function_name desc;

However, when descending order, it truncates APPROX_PERCENTILE_CONT_WITH_WEIGHT and APPROX_PERCENTILE_CONT.

We do only sample from the beginning. We can extend that to sample evenly in the first record batch but we want to avoid buffering all batches in memory. This might be enough to get a more representative width.

ugly table

Yeah I agree the long string table looks pretty bad.