-
**Describe the bug**
In the `RowGroupPruningStatistics`, we use the statistics to prune the row group for parquet file.
In the below logical:
https://github.com/apache/arrow-datafusion/blob/f386…
-
hi,
I run a SQL which contains four stages, and the 1st stage aims to scan the parquet files and prepare shuffle write data for the next stage and the mean time of tasks is about 4s. To reduce the …
-
### What happens?
When trying to create a table like this
```sql
CREATE TABLE xxx AS SELECT tbl.*, '12345' AS dedup_group
FROM read_parquet('path/glob/*.snappy.parquet') AS tbl…
-
Currently in Pinot we don't have real NULL value support, but use some special default values for NULL. For dimensions, the default value is the minimum value for numeric types, "null" for STRING, emp…
-
Inroduce new data type `Object()`, which will get the name of format for semi-structured data (`JSON`, `XML`, etc.).
Initially it will work only with `MergeTree` tables. Maybe later will add some oth…
-
Linked Issue: https://github.com/PyTables/PyTables/issues/319
Hi,
I have a dataframe saved in HDF5 with 6.7 million records (about 425MB). As you can see below, it gives an incorrect result when it …
-
I would like to propose some changes to the directory structure, but these might be totally irrelevant due to my misunderstanding.
Current directory entry is fixed at 17bytes, stores x,y as individ…
-
Hi there, I am trying to hack the arrow IPC format, I am confused about how does arrow differentiate between different type in record batch buffers and parse it.
for example, now I store a data fra…
-
The issue https://github.com/open-telemetry/opentelemetry-specification/issues/2589 explains why the direction for `disk` is not the right thing, which makes total sense.
But we got that blindly an…
-
I've been trying to find out the process of creating DataFrames in order to try to solve #2305 with minimal memory use. I've made some tests that put in a IPython notebook: http://nbviewer.ipython.org…