datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.68k stars 727 forks source link

Feature: CSV options do not fulfil our needs #16356

Open rad-pat opened 2 weeks ago

rad-pat commented 2 weeks ago

We are migrating over from Postgres/Greenplum where, when importing from CSV file, we are able to specify that a blank entry represent null, but a blank quoted string is an empty string. I cannot replicate this behaviour with Databend, but would very much like to. I have tried many combinations of the EMPTY_FIELD_AS and NULL_DISPLAY but I cannot get the same as the Postgres import. The ability to differentiate between blank entry and empty string would be very useful to us.

CSV Data:

some,value,,"",\N
some,other,,"",\N

Postgres Import

CREATE TABLE t_csv (
    col1 text null,
    col2 text null,
    "empty" text null,
    quoted text null,
    null_repr text null
);

COPY t_csv
FROM '/home/plaid/Downloads/test.csv'
WITH (
    FORMAT CSV,
    DELIMITER ',',
    QUOTE '"',
    ESCAPE '"',
    NULL ''
);

select * from t_csv;

Result

col1|col2 |empty |quoted|null_repr|
----+-----+------+------+---------+
some|value|<null>|      |\N       |
some|other|<null>|      |\N       |

Databend Import

CREATE TABLE t_csv (
    col1 string null,
    col2 string null,
    empty string null,
    quoted string null,
    null_repr string null
);
COPY INTO t_csv
FROM 'gcs://<snip>/test.csv'
CONNECTION = (
    CREDENTIAL='<snip>'
)
--PATTERN='test.csv'
FILE_FORMAT = (
    TYPE = CSV,
    FIELD_DELIMITER = ',', 
    QUOTE = '"', 
    ESCAPE = '\\', 
    NULL_DISPLAY = '\N', 
    EMPTY_FIELD_AS = string, 
    SKIP_HEADER = 0,
    COMPRESSION = AUTO 
)
FORCE = TRUE;

Result

"some","value",null,"","\N"
"some","other",null,"","\N"
col1|col2 |empty|quoted|null_repr|
----+-----+-----+------+---------+
some|value|     |      |<null>   |
some|other|     |      |<null>   |
youngsofun commented 2 weeks ago

@rad-pat thanks for feedback, it is a known problem.

  1. The CSV spec doesnโ€™t differentiate them
  2. the author of rust-csv, which databend depends on, sticks to the spec https://github.com/BurntSushi/rust-csv/issues/114
  3. spark do not distinguish them after 2.0.1, https://mrpowers.medium.com/sparks-treatment-of-empty-strings-and-null-values-in-csv-files-80748893451f (it is interesting that the article state that spark <=2.0.0 read blank/missing as empty string and the quoted one as null, reverse to what we expected)

    a workarounds is to dump the csv with special strings for null instead of "".

we may consider to support distinguish them too, recently python 13 seems to support it with some options, https://github.com/python/cpython/issues/113732

BTW, why do you have to distinguish null and empty string in your application?

inviscid commented 2 weeks ago

We receive CSVs from upstream customer systems that we have no control over. For the most part NULL is the correct default. However, there are real world situations where a blank string has a different meaning than null.