Open igorcalabria opened 1 year ago
@igorcalabria It's very little info. Does it reproduce when you use streaming? Can you reproduce it in Cube Cloud? We can take a look then.
@paveltiunov I've manged to reproduce this with a single CSV file. The effect that I'm seeing is that lines_stream
https://github.com/cube-js/cube/blob/2a0769f3bdf87695c3fd0a09b12352f7133f414d/rust/cubestore/cubestore/src/import/mod.rs#L102 ends before the end of the file (and I can´t find anything odd at the point in the stream).
I can´t send you the file since it contains internal data, so I'm digging a little deeper to either reproduce this with artificial data or to fix it myself.
Got it. The problem is when the file has unmatched quotes("
). Here's a sample
quotes.csv
one
two
three
four
"five
six
seven
eight
CREATE TABLE test.quote_test ("number" text) WITH (input_format = 'csv_no_header', delimiter = '^A') LOCATION 'http://localhost:8000/quotes.csv';
MySQL [(none)]> SELECT * FROM test.quote_test;
+--------+
| number |
+--------+
| four |
| one |
| three |
| two |
+--------+
Everything after the unmatched "
is lost. Maybe quote handling should be an option? Not every CSV wraps values in it. This also brings the question if any binary formats (parquet or orc) will be supported. It would be much more efficient and safer(no gotchas like this issue) to write and load, especially for higher cardinality aggregations.
If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.
@igorcalabria I Agree we need to support more options. As a workaround, you can configure Presto/Trino to provide format Cube Store supports: https://stackoverflow.com/questions/58726670/presto-syntax-for-csv-external-table-with-array-in-one-of-the-fields. Parquet as an import format is outside of our core team roadmap, but such contribution is very welcomed!
Shouldn't this issue be marked as bug or have another created? I just tested athena's unload and it doesn't quote data by default either. If cube cloud's code is not dramatically different from what I'm seeing on cubestore source there's a high chance that your customers are affected by this too.
The core issue is that data is being silently dropped and that's a pretty big problem in a data stack.
@igorcalabria Yep. It might be. If it's reproducible in Cube Cloud we can change it to bug then.
@paveltiunov I can confirm that this issue happens on cube cloud. Tested on a development instance on 0.33.41 using athena. To reproduce this, create a table like my previous message and enable export bucket for pre aggregations. The result should be the same as my message, you'll silently lose all records after the "
. The issue is around these lines https://github.com/cube-js/cube/blob/master/rust/cubestore/cubestore/src/import/mod.rs#L385 the code should error out if doesn't find a matching ". This was likely introduced when you added support for multline CSV: https://github.com/cube-js/cube/commit/860cf39cfc7a7302efb79d8e850aa793be264b00
@igorcalabria It has nothing to do with the Cube Store at this point, as it's a malformed CSV from the Cube Store's perspective. It doesn't support non-quoted CSV right now. So we should either add an option for the Cube Store to handle non-quoted CSVs or make the driver export quoted CSVs.
Hi, I think I've expressed myself very badly in the last few comments. In reality, this should be two separate issues, but I didn't know the proper etiquette of opening new ones (I didn't want to spam it). The issues are
The first one is a bug and second one is an important (to my eyes) enhancement. Being defensive when ingesting these kind of text formats is very important, especially in the case of databases like cubestore.
@igorcalabria Yep. Agree. Let's create two separate issues.
Hi, I'm implementing unload support for presto/trino's driver and I'm seeing an odd issue with cubestore's import behavior. The way I've implemented this is to follow the same strategy as athena's driver, but using
CREATE TABLE
instead ofUNLOAD
(since trino doesn't implement this).What I'm seeing is that sometimes, cubestore is dropping an arbitrary number of records from each file that it imports. I've double checked its logs and all files are reported as complete
To give you guys more context, the unload strategy is as follows
CREATE TABLE ... AS ...
query on trino that writes that data using TEXTFILE format with gzip compressionThis is basically the same strategy as athena's driver but using CREATE TABLE instead. I've compared cube's imported data and the data on the table, but couldn't find anything obvious yet. When the issue happens, about 50% of the records are missing on cube but breaking it down per imported file it varies a lot(from about 20% to almost 80%).
I realize that this issue is basically impossible to reproduce, but I"m looking for advice on which points cubestore may report that it successfully imported a file but it's only partially successful.