chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

TSV parsing issue returns an empty dataframe rather than indication of parsing issue #138

Closed earlev4 closed 6 months ago

earlev4 commented 10 months ago

Describe the unexpected behaviour

Hello CHDB creators and contributors! I am very new to CHDB and ClickHouse. I am thoroughly enjoying it. The performance is impressive. Thank you very much for the excellent project!

I am testing out CHDB with various file formats. So far, the Parquet file formats work perfectly and flawlessly. However, I have run into an issue with a TSV file. When reading a TSV file, I was receiving an empty dataframe. Unsure of why I was receiving an empty dataframe, I tried the same query on ClickHouse Local and received an error. There appears to be an issue parsing the TSV. Sure enough using CHDB, if I do a LIMIT up to the line with the parsing issue, I get a populated dataframe with rows up to the LIMIT. ClickHouse Local shows the last row returned and also provides a "Receive exception" message. Rather than producing an empty dataframe, should the last row and a message similar to ClickHouse Local be returned in CHDB indicating an issue? Basically, without ClickHouse Local, I wouldn't have known the issue. Please forgive me, if I am doing something wrong.

There definitely is a problem with parsing the file. I tried another embedded database and it also errored out due to parsing (but on another line). However, with that other embedded database, if I increase the sample size or scan the entire file, it can be read successfully. But the issue being reported is not the parsing issue, it is the ability to receive an error similar to ClickHouse Local.

I can open an issue on the parsing problem with ClickHouse, but wanted to get this error message addressed first, since it might be a good example.

How to reproduce

The dataset is from the IMDb Non-Commercial Datasets. Please also ensure the attached file is used since the dataset is refreshed daily, which could present problems in reproducing the issue. https://drive.google.com/file/d/1zQN-1qa49oh6opa3zAlVF5e_oE8cRu-e/view?usp=sharing

Using Google Colab High-RAM runtime:

  1. Upload TSV file to Google Colab - name.basics.tsv.gz.
  2. A query with LIMIT 10 returns a dataframe of 10 rows as expected.
    chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 10""", 'Dataframe')
  3. A query with no LIMIT returns an empty dataframe.
    chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')
  4. Confirmed the result is an empty dataframe.
    
    res = chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')

type(res) pandas.core.frame.DataFrame

res.shape (0, 0)

5. A query with the LIMIT set to the row before the parsing issue returns a dataframe with rows prior to the LIMIT.
```python
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 65408""", 'Dataframe')

Expected behavior

Expected behavior would be an error message similar to the error message produced in ClickHouse Local. ClickHouse Local provides the following message:

...Previously rows omitted for brevity.

nm0068534       Suzanne Bell    \N      \N      make_up_department      tt0099134,tt0092527,tt0099092,tt0099817
nm0068535       Tamara Bell     \N      \N      producer,executive      tt11388822,tt8244578
↙ Progress: 65.41 thousand rows, 1.68 MB (37.40 thousand rows/s., 962.72 KB/s.) ▎                                                  0%
65408 rows in set. Elapsed: 1.750 sec. Processed 65.41 thousand rows, 1.68 MB (37.38 thousand rows/s., 962.28 KB/s.)
Peak memory usage: 39.30 MiB.

Received exception:
Code: 27. DB::ParsingException: Cannot parse input: expected '\t' before: '001\t\\N\tassistant_director,producer,director\ttt2909116,tt1701210,tt0338512,tt0870921\nnm0144483\tTomas Castanos\t\\N\t\\N\tmiscellaneous\ttt0138467\nnm0144484\tDan Castano':
Row 6568:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144481"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Mauricio Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 3,   name: deathYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 4,   name: primaryProfession, type: Nullable(String), parsed text: "actor"
Column 5,   name: knownForTitles,    type: Nullable(String), parsed text: "tt0109747,tt3190448"

Row 6569:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144482"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Xavier Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "0"
ERROR: garbage after Nullable(Int64): "001<TAB><BACKSLASH>N<TAB>ass"

: While executing ParallelParsingBlockInputFormat: While executing File: (in file/uri /home/earlev4/name.basics.tsv.gz): (at row 137387)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)

I am unsure why row 65409 was not reported in the exception, but that can be addressed later with a ClickHouse GitHub issue regarding the parsing problem.

Also looked into input_format_allow_errors_num and input_format_allow_errors_ratio. The adjustment of input_format_allow_errors_num to 1 or input_format_allow_errors_ratio to 0.1 allow the query to bypass the errors and omits rows in ClickHouse Local. However, I was unsuccessful in changing these format settings in CHDB. This could be another issue; I am just not knowledgeable enough to know.   Additional context To further clarify, the issue being reported is not with parsing the TSV, but the expectation of receiving an error message to indicate an issue rather than an empty dataframe or return a dataframe with only the rows that could be read.

Looking forward to your insight. Thank you in advance for your assistance.

lmangani commented 10 months ago

Thanks for the report @earlev4 getting stderr from the process would be the only way. We do something similar in chdb-server to return stderr to the clients when queries fail.

There surely are better ways to do this in python and others might be able to provide examples, but in the meantime this is how to use the SET parameters and get your query to complete despite the format errors:

res = chdb.query("SET input_format_allow_errors_num  = 100; SELECT * from file('name.basics.tsv.gz', TabSeparatedWithNames)", 'Dataframe')
<class 'pandas.core.frame.DataFrame'>
(13051340, 6)
earlev4 commented 10 months ago

Hi @lmangani! Good day to you! Thanks so much for the very fast response! I sincerely appreciate it.

Without using SET input_format_allow_errors_num = 100 is there a way I would have known there was an issue?

When I get an empty dataframe, should I assume that there are errors and adjust the setting? Or should there be a response instead of an empty dataframe stating that there were errors reading the file?

For example this would have been very helpful:

chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')

Instead of an empty dataframe, provide a message:

The query encountered errors.
Please consider adjusting SET input_format_allow_errors_num or SET input_format_allow_errors_ratio to allow errors.

This error message would then prompt the user to try:

res = chdb.query("SET input_format_allow_errors_num  = 1; SELECT * from file('name.basics.tsv.gz', TabSeparatedWithNames)", 'Dataframe')

Without an error message, a new user might not know why there is an empty dataframe.

Looking forward to your feedback. Thanks again!

lmangani commented 10 months ago

👍 We understand the reported issue, and i've pointed(one of many ways to resolve it in chdb-server. You need to capture stderr.

earlev4 commented 10 months ago

Thanks again, @lmangani. I am grateful for the help. Did not notice the link you provided in chdb-server example.

The function in the chdb-server example provides the desired output.

import os
import chdb
import tempfile

driver = chdb

def chdb_query_with_errmsg(query, format):
    output = None
    errmsg = None

    try:
        # Create a temporary file for stderr and redirect stderr to it
        new_stderr = tempfile.TemporaryFile()
        old_stderr_fd = os.dup(2)
        os.dup2(new_stderr.fileno(), 2)

        # Execute the query
        output = driver.query(query, format)

        # Read from the temporary stderr file
        new_stderr.flush()
        new_stderr.seek(0)
        errmsg = new_stderr.read().decode()

        # Restore the original stderr
        new_stderr.close()
        os.dup2(old_stderr_fd, 2)

    except Exception as e:
        print(f"An error occurred: {e}")

    return output, errmsg

And received the following output:

(Empty DataFrame
 Columns: []
 Index: [],
 'Code: 27. DB::ParsingException: Cannot parse input: expected \'\\t\' before: \'001\\t\\\\N\\tassistant_director,producer,director\\ttt2909116,tt1701210,tt0338512,tt0870921\\nnm0144483\\tTomas Castanos\\t\\\\N\\t\\\\N\\tmiscellaneous\\ttt0138467\\nnm0144484\\tDan Castano\': \nRow 6568:\nColumn 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144481"\nColumn 1,   name: primaryName,       type: Nullable(String), parsed text: "Mauricio Castano"\nColumn 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"\nColumn 3,   name: deathYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"\nColumn 4,   name: primaryProfession, type: Nullable(String), parsed text: "actor"\nColumn 5,   name: knownForTitles,    type: Nullable(String), parsed text: "tt0109747,tt3190448"\n\nRow 6569:\nColumn 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144482"\nColumn 1,   name: primaryName,       type: Nullable(String), parsed text: "Xavier Castano"\nColumn 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "0"\nERROR: garbage after Nullable(Int64): "001<TAB><BACKSLASH>N<TAB>ass"\n\n: While executing ParallelParsingBlockInputFormat: While executing File: (in file/uri /content/name.basics.tsv.gz): (at row 137387)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)\n')
lmangani commented 10 months ago

@earlev4 thanks for the confirmation. this is of course a workaround but it should be functional, and i'm confident we'll have a more elegant method become available sooner than later.

earlev4 commented 10 months ago

Thanks so much, @lmangani! That's wonderful news. I sincerely appreciate your assistance and time!

auxten commented 9 months ago

Capture the stderr is an issue that we will manage to fix. @nmreadelf has a PR #105 to fix that, but it'll cause some back compatible problem. I will try to come up an better way based on #105.

Again, thanks @earlev4 report the issue.

earlev4 commented 9 months ago

Thanks so much, @auxten! I appreciate the response and grateful for all the excellent work by you and the chDB contributors!

auxten commented 6 months ago

Fixed with #105 If any error in engine, an exception will be raised with clickhouse internal error message

earlev4 commented 5 months ago

Thank you very much, @auxten! I appreciate the fix!