Command-line tool for viewing, querying, converting, and comparing files in popular data formats (CSV, Parquet, JSON, and Avro).
Powered by Apache Arrow and DataFusion.
Rust must be installed first. Follow instructions at https://rustup.rs/.
cargo install bdt
Boring Data Tool
USAGE:
bdt <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
compare Compare the contents of two files
convert Convert a file to a different format
count Show the row count of the file
help Prints this message or the help of the given subcommand(s)
query Run a SQL query against one or more files
schema View schema of a file
view View contents of a file
view-parquet-meta View Parquet metadata
bdt schema /mnt/bigdata/nyctaxi/yellow_tripdata_2022-01.parquet
+-----------------------+-----------------------------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------------------------+-------------+
| VendorID | Int64 | YES |
| tpep_pickup_datetime | Timestamp(Nanosecond, None) | YES |
| tpep_dropoff_datetime | Timestamp(Nanosecond, None) | YES |
| passenger_count | Float64 | YES |
| trip_distance | Float64 | YES |
| RatecodeID | Float64 | YES |
| store_and_fwd_flag | Utf8 | YES |
| PULocationID | Int64 | YES |
| DOLocationID | Int64 | YES |
| payment_type | Int64 | YES |
| fare_amount | Float64 | YES |
| extra | Float64 | YES |
| mta_tax | Float64 | YES |
| tip_amount | Float64 | YES |
| tolls_amount | Float64 | YES |
| improvement_surcharge | Float64 | YES |
| total_amount | Float64 | YES |
| congestion_surcharge | Float64 | YES |
| airport_fee | Float64 | YES |
+-----------------------+-----------------------------+-------------+
$ bdt view /path/to/file.parquet --limit 10
+-----------+------------------+--------+--------+----------+----------+---------+---------+-------------+-------------+
| t_time_sk | t_time_id | t_time | t_hour | t_minute | t_second | t_am_pm | t_shift | t_sub_shift | t_meal_time |
+-----------+------------------+--------+--------+----------+----------+---------+---------+-------------+-------------+
| 0 | AAAAAAAABAAAAAAA | 0 | 0 | 0 | 0 | AM | third | night | |
| 1 | AAAAAAAACAAAAAAA | 1 | 0 | 0 | 1 | AM | third | night | |
| 2 | AAAAAAAADAAAAAAA | 2 | 0 | 0 | 2 | AM | third | night | |
| 3 | AAAAAAAAEAAAAAAA | 3 | 0 | 0 | 3 | AM | third | night | |
| 4 | AAAAAAAAFAAAAAAA | 4 | 0 | 0 | 4 | AM | third | night | |
| 5 | AAAAAAAAGAAAAAAA | 5 | 0 | 0 | 5 | AM | third | night | |
| 6 | AAAAAAAAHAAAAAAA | 6 | 0 | 0 | 6 | AM | third | night | |
| 7 | AAAAAAAAIAAAAAAA | 7 | 0 | 0 | 7 | AM | third | night | |
| 8 | AAAAAAAAJAAAAAAA | 8 | 0 | 0 | 8 | AM | third | night | |
| 9 | AAAAAAAAKAAAAAAA | 9 | 0 | 0 | 9 | AM | third | night | |
+-----------+------------------+--------+--------+----------+----------+---------+---------+-------------+-------------+
Queries can be run against one or more tables. Table names are inferred from file names.
$ bdt query --table /mnt/bigdata/nyctaxi/yellow_tripdata_2022-01.parquet \
--sql "SELECT COUNT(*) FROM yellow_tripdata_2022_01"
Registering table 'yellow_tripdata_2022_01' for /mnt/bigdata/nyctaxi/yellow_tripdata_2022-01.parquet
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 2463931 |
+-----------------+
Use the --tables
option to register all files/directories in one directory as tables, and use the --sql-file
option
to load a query from disk.
$ bdt query --tables /mnt/bigdata/tpch/sf10-parquet/ --sql-file /home/andy/git/sql-benchmarks/sqlbench-h/queries/sf=10/q1.sql`
Registering table 'supplier' for /mnt/bigdata/tpch/sf10-parquet/supplier.parquet
Registering table 'part' for /mnt/bigdata/tpch/sf10-parquet/part.parquet
Registering table 'partsupp' for /mnt/bigdata/tpch/sf10-parquet/partsupp.parquet
Registering table 'nation' for /mnt/bigdata/tpch/sf10-parquet/nation.parquet
Registering table 'region' for /mnt/bigdata/tpch/sf10-parquet/region.parquet
Registering table 'orders' for /mnt/bigdata/tpch/sf10-parquet/orders.parquet
Registering table 'lineitem' for /mnt/bigdata/tpch/sf10-parquet/lineitem.parquet
Registering table 'customer' for /mnt/bigdata/tpch/sf10-parquet/customer.parquet
+--------------+--------------+--------------+------------------+--------------------+----------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+--------------+------------------+--------------------+----------------------+-----------+--------------+----------+-------------+
| A | F | 377518277.00 | 566065563002.85 | 537758943278.1740 | 559276505545.688411 | 25.500977 | 38237.155374 | 0.050006 | 14804071 |
| N | F | 9851614.00 | 14767438399.17 | 14028805792.2114 | 14590490998.366737 | 25.522448 | 38257.810660 | 0.049973 | 385998 |
| N | O | 730783087.00 | 1095795289143.27 | 1041001162690.9297 | 1082653834336.561576 | 25.497622 | 38233.198852 | 0.049999 | 28660832 |
| R | F | 377732634.00 | 566430710070.73 | 538110604499.8196 | 559634448619.890015 | 25.508381 | 38251.211480 | 0.049996 | 14808177 |
+--------------+--------------+--------------+------------------+--------------------+----------------------+-----------+--------------+----------+-------------+
Query results can also be written to disk by specifying an --output
path.
$ bdt query --table /mnt/bigdata/nyctaxi/yellow_tripdata_2022-01.parquet \
--sql "SELECT COUNT(*) FROM yellow_tripdata_2022_01" \
--output results.csv
Registering table 'yellow_tripdata_2022_01' for /mnt/bigdata/nyctaxi/yellow_tripdata_2022-01.parquet
Writing results in CSV format to results.csv
$ bdt convert /path/to/input.parquet /path/to/output.json
$ cat /path/to/output.json
{"d_date_sk":2415022,"d_date_id":"AAAAAAAAOKJNECAA","d_date":"1900-01-02","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":1,"d_moy":1,"d_dom":2,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Monday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"Y","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414657,"d_same_day_lq":2414930,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
{"d_date_sk":2415023,"d_date_id":"AAAAAAAAPKJNECAA","d_date":"1900-01-03","d_month_seq":0,"d_week_seq":1,"d_quarter_seq":1,"d_year":1900,"d_dow":2,"d_moy":1,"d_dom":3,"d_qoy":1,"d_fy_year":1900,"d_fy_quarter_seq":1,"d_fy_week_seq":1,"d_day_name":"Tuesday","d_quarter_name":"1900Q1","d_holiday":"N","d_weekend":"N","d_following_holiday":"N","d_first_dom":2415021,"d_last_dom":2415020,"d_same_day_ly":2414658,"d_same_day_lq":2414931,"d_current_day":"N","d_current_week":"N","d_current_month":"N","d_current_quarter":"N","d_current_year":"N"}
$ bdt view-parquet-meta /mnt/bigdata/tpcds/sf100-parquet/store_sales.parquet/part-00000-cff04137-32a6-4e5b-811a-668f5d4b1802-c000.snappy.parquet
+------------+----------------------------------------------------------------------------+
| Key | Value |
+------------+----------------------------------------------------------------------------+
| Version | 1 |
| Created By | parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) |
| Rows | 40016 |
| Row Groups | 1 |
+------------+----------------------------------------------------------------------------+
Row Group 0 of 1 contains 40016 rows and has 190952 bytes:
+-----------------------+--------------+---------------+-----------------+-------+-----------------------------------------------------+------------------------------------+
| Column Name | Logical Type | Physical Type | Distinct Values | Nulls | Min | Max |
+-----------------------+--------------+---------------+-----------------+-------+-----------------------------------------------------+------------------------------------+
| cd_demo_sk | N/A | INT32 | N/A | 0 | 1520641 | 1560656 |
| cd_gender | N/A | BYTE_ARRAY | N/A | 0 | [70] | [77] |
| cd_marital_status | N/A | BYTE_ARRAY | N/A | 0 | [68] | [87] |
| cd_education_status | N/A | BYTE_ARRAY | N/A | 0 | [50, 32, 121, 114, 32, 68, 101, 103, 114, 101, 101] | [85, 110, 107, 110, 111, 119, 110] |
| cd_purchase_estimate | N/A | INT32 | N/A | 0 | 500 | 10000 |
| cd_credit_rating | N/A | BYTE_ARRAY | N/A | 0 | [71, 111, 111, 100] | [85, 110, 107, 110, 111, 119, 110] |
| cd_dep_count | N/A | INT32 | N/A | 0 | 0 | 6 |
| cd_dep_employed_count | N/A | INT32 | N/A | 0 | 3 | 4 |
| cd_dep_college_count | N/A | INT32 | N/A | 0 | 5 | 5 |
+-----------------------+--------------+---------------+-----------------+-------+-----------------------------------------------------+------------------------------------+