Closed earlev4 closed 6 months ago
Thanks for the report @earlev4 getting stderr from the process would be the only way. We do something similar in chdb-server to return stderr to the clients when queries fail.
There surely are better ways to do this in python and others might be able to provide examples, but in the meantime this is how to use the SET parameters and get your query to complete despite the format errors:
res = chdb.query("SET input_format_allow_errors_num = 100; SELECT * from file('name.basics.tsv.gz', TabSeparatedWithNames)", 'Dataframe')
<class 'pandas.core.frame.DataFrame'>
(13051340, 6)
Hi @lmangani! Good day to you! Thanks so much for the very fast response! I sincerely appreciate it.
Without using SET input_format_allow_errors_num = 100
is there a way I would have known there was an issue?
When I get an empty dataframe, should I assume that there are errors and adjust the setting? Or should there be a response instead of an empty dataframe stating that there were errors reading the file?
For example this would have been very helpful:
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')
Instead of an empty dataframe, provide a message:
The query encountered errors.
Please consider adjusting SET input_format_allow_errors_num or SET input_format_allow_errors_ratio to allow errors.
This error message would then prompt the user to try:
res = chdb.query("SET input_format_allow_errors_num = 1; SELECT * from file('name.basics.tsv.gz', TabSeparatedWithNames)", 'Dataframe')
Without an error message, a new user might not know why there is an empty dataframe.
Looking forward to your feedback. Thanks again!
👍 We understand the reported issue, and i've pointed(one of many ways to resolve it in chdb-server. You need to capture stderr.
Thanks again, @lmangani. I am grateful for the help. Did not notice the link you provided in chdb-server example.
The function in the chdb-server example provides the desired output.
import os
import chdb
import tempfile
driver = chdb
def chdb_query_with_errmsg(query, format):
output = None
errmsg = None
try:
# Create a temporary file for stderr and redirect stderr to it
new_stderr = tempfile.TemporaryFile()
old_stderr_fd = os.dup(2)
os.dup2(new_stderr.fileno(), 2)
# Execute the query
output = driver.query(query, format)
# Read from the temporary stderr file
new_stderr.flush()
new_stderr.seek(0)
errmsg = new_stderr.read().decode()
# Restore the original stderr
new_stderr.close()
os.dup2(old_stderr_fd, 2)
except Exception as e:
print(f"An error occurred: {e}")
return output, errmsg
And received the following output:
(Empty DataFrame
Columns: []
Index: [],
'Code: 27. DB::ParsingException: Cannot parse input: expected \'\\t\' before: \'001\\t\\\\N\\tassistant_director,producer,director\\ttt2909116,tt1701210,tt0338512,tt0870921\\nnm0144483\\tTomas Castanos\\t\\\\N\\t\\\\N\\tmiscellaneous\\ttt0138467\\nnm0144484\\tDan Castano\': \nRow 6568:\nColumn 0, name: nconst, type: Nullable(String), parsed text: "nm0144481"\nColumn 1, name: primaryName, type: Nullable(String), parsed text: "Mauricio Castano"\nColumn 2, name: birthYear, type: Nullable(Int64), parsed text: "<BACKSLASH>N"\nColumn 3, name: deathYear, type: Nullable(Int64), parsed text: "<BACKSLASH>N"\nColumn 4, name: primaryProfession, type: Nullable(String), parsed text: "actor"\nColumn 5, name: knownForTitles, type: Nullable(String), parsed text: "tt0109747,tt3190448"\n\nRow 6569:\nColumn 0, name: nconst, type: Nullable(String), parsed text: "nm0144482"\nColumn 1, name: primaryName, type: Nullable(String), parsed text: "Xavier Castano"\nColumn 2, name: birthYear, type: Nullable(Int64), parsed text: "0"\nERROR: garbage after Nullable(Int64): "001<TAB><BACKSLASH>N<TAB>ass"\n\n: While executing ParallelParsingBlockInputFormat: While executing File: (in file/uri /content/name.basics.tsv.gz): (at row 137387)\n. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)\n')
@earlev4 thanks for the confirmation. this is of course a workaround but it should be functional, and i'm confident we'll have a more elegant method become available sooner than later.
Thanks so much, @lmangani! That's wonderful news. I sincerely appreciate your assistance and time!
Capture the stderr is an issue that we will manage to fix. @nmreadelf has a PR #105 to fix that, but it'll cause some back compatible problem. I will try to come up an better way based on #105.
Again, thanks @earlev4 report the issue.
Thanks so much, @auxten! I appreciate the response and grateful for all the excellent work by you and the chDB contributors!
Fixed with #105 If any error in engine, an exception will be raised with clickhouse internal error message
Thank you very much, @auxten! I appreciate the fix!
Describe the unexpected behaviour
Hello CHDB creators and contributors! I am very new to CHDB and ClickHouse. I am thoroughly enjoying it. The performance is impressive. Thank you very much for the excellent project!
I am testing out CHDB with various file formats. So far, the Parquet file formats work perfectly and flawlessly. However, I have run into an issue with a TSV file. When reading a TSV file, I was receiving an empty dataframe. Unsure of why I was receiving an empty dataframe, I tried the same query on ClickHouse Local and received an error. There appears to be an issue parsing the TSV. Sure enough using CHDB, if I do a LIMIT up to the line with the parsing issue, I get a populated dataframe with rows up to the LIMIT. ClickHouse Local shows the last row returned and also provides a "Receive exception" message. Rather than producing an empty dataframe, should the last row and a message similar to ClickHouse Local be returned in CHDB indicating an issue? Basically, without ClickHouse Local, I wouldn't have known the issue. Please forgive me, if I am doing something wrong.
There definitely is a problem with parsing the file. I tried another embedded database and it also errored out due to parsing (but on another line). However, with that other embedded database, if I increase the sample size or scan the entire file, it can be read successfully. But the issue being reported is not the parsing issue, it is the ability to receive an error similar to ClickHouse Local.
I can open an issue on the parsing problem with ClickHouse, but wanted to get this error message addressed first, since it might be a good example.
How to reproduce
The dataset is from the IMDb Non-Commercial Datasets. Please also ensure the attached file is used since the dataset is refreshed daily, which could present problems in reproducing the issue. https://drive.google.com/file/d/1zQN-1qa49oh6opa3zAlVF5e_oE8cRu-e/view?usp=sharing
Using Google Colab High-RAM runtime:
name.basics.tsv.gz
.LIMIT 10
returns a dataframe of 10 rows as expected.LIMIT
returns an empty dataframe.type(res) pandas.core.frame.DataFrame
res.shape (0, 0)
Expected behavior
Expected behavior would be an error message similar to the error message produced in ClickHouse Local. ClickHouse Local provides the following message:
...Previously rows omitted for brevity.
I am unsure why row 65409 was not reported in the exception, but that can be addressed later with a ClickHouse GitHub issue regarding the parsing problem.
Also looked into
input_format_allow_errors_num
andinput_format_allow_errors_ratio
. The adjustment ofinput_format_allow_errors_num
to1
orinput_format_allow_errors_ratio
to0.1
allow the query to bypass the errors and omits rows in ClickHouse Local. However, I was unsuccessful in changing these format settings in CHDB. This could be another issue; I am just not knowledgeable enough to know. Additional context To further clarify, the issue being reported is not with parsing the TSV, but the expectation of receiving an error message to indicate an issue rather than an empty dataframe or return a dataframe with only the rows that could be read.Looking forward to your insight. Thank you in advance for your assistance.