duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
24.2k stars 1.92k forks source link

read_text('/dev/stdin') #14411

Open pkoppstein opened 3 weeks ago

pkoppstein commented 3 weeks ago

What happens?

This might technically be an Enhancement Request, but since read_csv() understands '/dev/stdin' and read_text() appears not to, it certainly seems more like a bug.

$ echo abc | duckdb -c "from read_csv('/dev/stdin', header=false)"
┌─────────┐
│ column0 │
│ varchar │
├─────────┤
│ abc     │
└─────────┘

BUT:

$ echo abc | duckdb -c "from read_text('/dev/stdin')"
┌────────────┬─────────┬───────┬───────────────┐
│  filename  │ content │ size  │ last_modified │
│  varchar   │ varchar │ int64 │   timestamp   │
├────────────┼─────────┼───────┼───────────────┤
│ /dev/stdin │         │     0 │               │
└────────────┴─────────┴───────┴───────────────┘

Also, please note that attempting to read both the program and its data from stdin produces weird results:

$ echo abc | duckdb <<< "from read_text('/dev/stdin');"
┌────────────┬────────────────────────────────────────────────────────────────┬───────┬─────────────────────┐
│  filename  │                            content                             │ size  │    last_modified    │
│  varchar   │                            varchar                             │ int64 │      timestamp      │
├────────────┼────────────────────────────────────────────────────────────────┼───────┼─────────────────────┤
│ /dev/stdin │ \0\0\0\0\0\0\0\0\24\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\0\0\0\0 │    30 │ 2024-10-16 21:37:24 │
└────────────┴────────────────────────────────────────────────────────────────┴───────┴─────────────────────┘

To Reproduce

echo abc | duckdb -c "from read_text('/dev/stdin')"

OS:

MacOS

DuckDB Version:

v1.1.2-dev218 │

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Peter Koppstein

Affiliation:

Princeton University

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

kzys commented 4 days ago

Reading /dev/stdin is tricky. read_text() reads the content in VARCHAR and DuckDB's VARCHAR is limited to uint32_t. In order to check the size limit, read_text() checks the size and reads the file till that.

https://github.com/duckdb/duckdb/blob/242d3b85ea9fc7fde6e96babd65e360e9c369c2f/src/function/table/read_file.cpp#L157

But /dev/stdin's size is zero.

% ls -l /dev/stdin
lrwxrwxrwx 1 root root 15 Nov  8 07:58 /dev/stdin -> /proc/self/fd/0

read_csv doesn't have this issue since it doesn't put all into one column.

pkoppstein commented 4 days ago

@kzys - thanks for the response. Since I’d really like to see some good alternatives to reading text files that do not rely on read_csv(), I think it would be worthwhile expanding the discussion a bit to include e.g. the possibility of adding a line-by-line mode for read_text(). If such a mode were introduced, then it would be reasonable for read_text() to require it if presented with STDIN.