cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.8k stars 1.76k forks source link

Issue with cubestore TableImport during preAggregation with unload: support more CSV quote options #7071

Open igorcalabria opened 1 year ago

igorcalabria commented 1 year ago

Hi, I'm implementing unload support for presto/trino's driver and I'm seeing an odd issue with cubestore's import behavior. The way I've implemented this is to follow the same strategy as athena's driver, but using CREATE TABLE instead of UNLOAD (since trino doesn't implement this).

What I'm seeing is that sometimes, cubestore is dropping an arbitrary number of records from each file that it imports. I've double checked its logs and all files are reported as complete

<pid:1> Running job completed (14.535796782s): IdRow { ...

To give you guys more context, the unload strategy is as follows

This is basically the same strategy as athena's driver but using CREATE TABLE instead. I've compared cube's imported data and the data on the table, but couldn't find anything obvious yet. When the issue happens, about 50% of the records are missing on cube but breaking it down per imported file it varies a lot(from about 20% to almost 80%).

I realize that this issue is basically impossible to reproduce, but I"m looking for advice on which points cubestore may report that it successfully imported a file but it's only partially successful.

paveltiunov commented 1 year ago

@igorcalabria It's very little info. Does it reproduce when you use streaming? Can you reproduce it in Cube Cloud? We can take a look then.

igorcalabria commented 1 year ago

@paveltiunov I've manged to reproduce this with a single CSV file. The effect that I'm seeing is that lines_stream https://github.com/cube-js/cube/blob/2a0769f3bdf87695c3fd0a09b12352f7133f414d/rust/cubestore/cubestore/src/import/mod.rs#L102 ends before the end of the file (and I can´t find anything odd at the point in the stream).

I can´t send you the file since it contains internal data, so I'm digging a little deeper to either reproduce this with artificial data or to fix it myself.

igorcalabria commented 1 year ago

Got it. The problem is when the file has unmatched quotes("). Here's a sample quotes.csv

one
two
three
four 
"five
six
seven
eight
CREATE TABLE test.quote_test ("number" text) WITH (input_format = 'csv_no_header', delimiter = '^A') LOCATION 'http://localhost:8000/quotes.csv';
MySQL [(none)]> SELECT * FROM test.quote_test;
+--------+
| number |
+--------+
| four   |
| one    |
| three  |
| two    |
+--------+

Everything after the unmatched " is lost. Maybe quote handling should be an option? Not every CSV wraps values in it. This also brings the question if any binary formats (parquet or orc) will be supported. It would be much more efficient and safer(no gotchas like this issue) to write and load, especially for higher cardinality aggregations.

github-actions[bot] commented 1 year ago

If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.

paveltiunov commented 1 year ago

@igorcalabria I Agree we need to support more options. As a workaround, you can configure Presto/Trino to provide format Cube Store supports: https://stackoverflow.com/questions/58726670/presto-syntax-for-csv-external-table-with-array-in-one-of-the-fields. Parquet as an import format is outside of our core team roadmap, but such contribution is very welcomed!

igorcalabria commented 1 year ago

Shouldn't this issue be marked as bug or have another created? I just tested athena's unload and it doesn't quote data by default either. If cube cloud's code is not dramatically different from what I'm seeing on cubestore source there's a high chance that your customers are affected by this too.

The core issue is that data is being silently dropped and that's a pretty big problem in a data stack.

paveltiunov commented 1 year ago

@igorcalabria Yep. It might be. If it's reproducible in Cube Cloud we can change it to bug then.

igorcalabria commented 1 year ago

@paveltiunov I can confirm that this issue happens on cube cloud. Tested on a development instance on 0.33.41 using athena. To reproduce this, create a table like my previous message and enable export bucket for pre aggregations. The result should be the same as my message, you'll silently lose all records after the ". The issue is around these lines https://github.com/cube-js/cube/blob/master/rust/cubestore/cubestore/src/import/mod.rs#L385 the code should error out if doesn't find a matching ". This was likely introduced when you added support for multline CSV: https://github.com/cube-js/cube/commit/860cf39cfc7a7302efb79d8e850aa793be264b00

paveltiunov commented 1 year ago

@igorcalabria It has nothing to do with the Cube Store at this point, as it's a malformed CSV from the Cube Store's perspective. It doesn't support non-quoted CSV right now. So we should either add an option for the Cube Store to handle non-quoted CSVs or make the driver export quoted CSVs.

igorcalabria commented 1 year ago

Hi, I think I've expressed myself very badly in the last few comments. In reality, this should be two separate issues, but I didn't know the proper etiquette of opening new ones (I didn't want to spam it). The issues are

The first one is a bug and second one is an important (to my eyes) enhancement. Being defensive when ingesting these kind of text formats is very important, especially in the case of databases like cubestore.

paveltiunov commented 1 year ago

@igorcalabria Yep. Agree. Let's create two separate issues.