apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.31k stars 1.19k forks source link

Bug with csv type inference #3174

Closed andygrove closed 1 month ago

andygrove commented 2 years ago

Describe the bug

❯ CREATE EXTERNAL TABLE test
STORED AS CSV
WITH HEADER ROW
LOCATION '../testing/data/csv/aggregate_test_100.csv';
0 rows in set. Query took 0.102 seconds.
❯ select * from test;
ArrowError(ParseError("Error while parsing value 14857091259186476033 for column 9 at line 3"))

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

bezbac commented 2 years ago

Hello 👋🏻.

I haven't contributed here before but checked out this issue because of its "Good First Issue" label. Maybe you could help me figure out what the expected behavior is.

The cell's value inside the CSV file (14857091259186476033) overflows the inferred type Int64. So assuming the above query is supposed to work given the aggregate_test_100.csv file as input, I would suspect the column should be inferred as UInt64. To fix this, changes would likely be needed within the arrow crate, as the CSV infer_schema functions utilizes the arrow::civ::reader::infer_reader_schema function.

Do you have any direction on how to proceed?

alamb commented 2 years ago

👋 @bezbac

I think looking into improving the inference in the arrow crate would be a good step

Another potential alternative to using UInt64 might be Decimal128 (though I don't want to mislead you too much here)

kmitchener commented 2 years ago

There's also a TODO directly related to this that could be cleaned up as part of this PR -> https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/mod.rs#L624

alamb commented 2 months ago

This still happens as of datafusion 42

> CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'testing/data/csv/aggregate_test_100.csv';
0 row(s) fetched.
Elapsed 0.009 seconds.

> select * from test;
Arrow error: Parser error: Error while parsing value 11720144131976083864 for column 9 at line 2
CookiePieWw commented 1 month ago

Hi, I took a look based on @bezbac's findings, and found that arrow-csv uses regex expr to match strings and then infer ther types. The regex for Int64 accept all possible numbers instead of numbers in ranges of Int64, which caused that the inferred type actually cannot accept the strings. (Seems github cannot attach codes through links that does not belong to this repo, so I copied here)

/// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
/// #L146-L158
    /// Order should match [`InferredDataType`]
    static ref REGEX_SET: RegexSet = RegexSet::new([
        // ...
        r"^-?(\d+)$", //INTEGER
        // ...
    ]).unwrap();

/// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
/// #L214-L223
    /// Updates the [`InferredDataType`] with the given string
    fn update(&mut self, string: &str) {
        self.packed |= if string.starts_with('"') {
            1 << 8 // Utf8
        } else if let Some(m) = REGEX_SET.matches(string).into_iter().next() {
            1 << m
        } else {
            1 << 8 // Utf8
        }
    }

One of the solutions is to change the regex expr for Int64 to match the range of it, but it seems to be a very complicate one. I told gpt to give me one but the regex expr's more than 300 chars, so I wonder if there're alternatives :)

alamb commented 1 month ago

I wonder if you could use a length check on the characters too (it wouldn't be perfect, but it might catch more)

As in you could bound the number of characters matched by the regexp to the number of characters in Int64::MAX (20 or so??)

CookiePieWw commented 1 month ago

I wonder if you could use a length check on the characters too (it wouldn't be perfect, but it might catch more)

Thanks for your suggestion! But if we want to precisely match it, we have to iter through the prefix of Int64::MAX and then we got:

922337203685477580[0-7]
9223372036854775[0-7]\d{2}
922337203685477[0-4]\d{3}
92233720368547[0-6]\d{4}

and so on...

Perhaps we can use a function for match strings here instead of direct regex here, which seems more flexible for me. If you agree on that, I'll try to submit a pr to arrow-rs for further discussions :)

alamb commented 1 month ago

Perhaps we can use a function for match strings here instead of direct regex here, which seems more flexible for me. If you agree on that, I'll try to submit a pr to arrow-rs for further discussions :)

That seems like a good idea to me -- with sufficient care it should also be significantly faster than a regexp based matching approach.

alamb commented 1 month ago

I verified that this behavior now works well after upgrading to arrow 53.1.0 due to https://github.com/apache/arrow-rs/pull/6481 (thanks @CookiePieWw ❤️ )

DataFusion CLI v42.0.0
> CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '../testing/data/csv/aggregate_test_100.csv'
;
0 row(s) fetched.
Elapsed 0.061 seconds.

> select * from test;
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
| c1 | c2 | c3   | c4     | c5          | c6                   | c7  | c8    | c9         | c10                  | c11         | c12                 | c13                            |
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
| c  | 2  | 1    | 18109  | 2033001162  | -6513304855495910254 | 25  | 43062 | 1491205016 | 5863949479783605708  | 0.110830784 | 0.9294097332465232  | 6WfVFBVGJSQb7FhA7E0lBwdvjfZnSW |
| d  | 5  | -40  | 22614  | 706441268   | -7542719935673075327 | 155 | 14337 | 3373581039 | 11720144131976083864 | 0.69632107  | 0.3114712539863804  | C2GT5KVyOPZpgKVl110TyZO0NcJ434 |
| b  | 1  | 29   | -18218 | 994303988   | 5983957848665088916  | 204 | 9489  | 3275293996 | 14857091259186476033 | 0.53840446  | 0.17909035118828576 | AyYVExXK6AR2qUTxNZ7qRHQOVGMLcz |
| a  | 1  | -85  | -15154 | 1171968280  | 1919439543497968449  | 77  | 52286 | 774637006  | 12101411955859039553 | 0.12285209  | 0.6864391962767343  | 0keZ5G8BffGwgF2RwQD59TFzMStxCB |
| b  | 5  | -82  | 22080  | 1824882165  | 7373730676428214987  | 208 | 34331 | 3342719438 | 3330177516592499461  | 0.82634634  | 0.40975383525297016 | Ig1QcuKsjHXkproePdERo2w0mYzIqd |
| b  | 4  | -111 | -1967  | -4229382    | 1892872227362838079  | 67  | 9832  | 1243785310 | 8382489916947120498  | 0.06563997  | 0.152498292971736   | Sfx0vxv1skzZWT1PqVdoRDdO6Sb6xH |
| e  | 3  | 104  | -25136 | 1738331255  | 300633854973581194   | 139 | 20807 | 3577318119 | 13079037564113702254 | 0.40154034  | 0.7764360990307122  | DuJNG8tufSqW0ZstHqWj3aGvFLMg4A |
| a  | 3  | 13   | 12613  | 1299719633  | 2020498574254265315  | 191 | 17835 | 3998790955 | 14881411008939145569 | 0.041445434 | 0.8813167497816289  | Amn2K87Db5Es3dFQO9cw9cvpAM6h35 |
| d  | 1  | 38   | 18384  | -335410409  | -1632237090406591229 | 26  | 57510 | 2712615025 | 1842662804748246269  | 0.6064476   | 0.6404495093354053  | 4HX6feIvmNXBN7XGqgO4YVBkhu8GDI |
| a  | 4  | -38  | 20744  | 762932956   | 308913475857409919   | 7   | 45465 | 1787652631 | 878137512938218976   | 0.7459874   | 0.02182578039211991 | ydkwycaISlYSlEq3TlkS2m15I2pcp8 |
| d  | 1  | 57   | 28781  | -1143802338 | 2662536767954229885  | 202 | 62167 | 879082834  | 4338034436871150616  | 0.7618384   | 0.42950521730777025 | VY0zXmXeksCT8BzvpzpPLbmU9Kp9Y4 |
| a  | 4  | -54  | -2376  | 434021400   | 5502271306323260832  | 113 | 15777 | 2502326480 | 7966148640299601101  | 0.5720931   | 0.30585375151301186 | KJFcmTVjdkCMv94wYCtfHMFhzyRsmH |
| e  | 3  | 112  | -6823  | -421042466  | 8535335158538929274  | 129 | 32712 | 3759340273 | 9916295859593918600  | 0.6424343   | 0.6316565296547284  | BsM5ZAYifRh5Lw3Y8X1r53I0cTJnfE |
| d  | 2  | 113  | 3917   | -108973366  | -7220140168410319165 | 197 | 24380 | 63044568   | 4225581724448081782  | 0.11867094  | 0.2944158618048994  | 90gAtmGEeIqUTbo1ZrxCvWtsseukXC |
| b  | 1  | 54   | -18410 | 1413111008  | -7145106120930085900 | 249 | 5382  | 1842680163 | 17818611040257178339 | 0.8881188   | 0.24899794314659673 | 6FPJlLAcaQ5uokyOWZ9HGdLZObFvOZ |
| c  | 1  | 103  | -22186 | 431378678   | 1346564663822463162  | 146 | 12393 | 3766999078 | 10901819591635583995 | 0.064453244 | 0.7784918983501654  | 2T3wSlHdEmASmO0xcXHnndkKEt6bz8 |
| e  | 2  | 49   | 24495  | -587831330  | 9178511478067509438  | 129 | 12757 | 1289293657 | 10948666249269100825 | 0.5610077   | 0.5991138115095911  | bgK1r6v3BCTh0aejJUhkA1Hn6idXGp |
| d  | 1  | -98  | 13630  | -1991133944 | 1184110014998006843  | 220 | 2986  | 225513085  | 9634106610243643486  | 0.89651865  | 0.1640882545084913  | y7C453hRWd4E7ImjNDWlpexB8nUqjh |
| d  | 3  | 77   | 15091  | -1302295658 | 8795481303066536947  | 154 | 35477 | 2093538928 | 17419098323248948387 | 0.11952883  | 0.7035635283169166  | O66j6PaYuZhEUtqV6fuU7TyjM2WxC5 |
| e  | 2  | 97   | 18167  | 1593800404  | -9112448817105133638 | 163 | 45185 | 3188005828 | 2792105417953811674  | 0.38175434  | 0.4094218353587008  | ukOiFGGFnQJDHFgZxHMpvhD3zybF0M |
| e  | 4  | -56  | -31500 | 1544188174  | 3096047390018154410  | 220 | 417   | 557517119  | 2774306934041974261  | 0.15459597  | 0.19113293583306745 | IZTkHMLvIKuiLjhDjYMmIHxh166we4 |
| d  | 1  | -99  | 5613   | 1213926989  | -8863698443222021480 | 19  | 18736 | 4216440507 | 14933742247195536130 | 0.6067944   | 0.33639590659276175 | aDxBtor7Icd9C5hnTvvw5NrIre740e |
| a  | 5  | 36   | -16974 | 623103518   | 6834444206535996609  | 71  | 29458 | 141047417  | 17448660630302620693 | 0.17100024  | 0.04429073092078406 | OF7fQ37GzaZ5ikA2oMyvleKtgnLjXh |
| e  | 4  | -53  | 13788  | 2064155045  | -691093532952651300  | 243 | 35106 | 2778168728 | 9463973906560740422  | 0.34515214  | 0.27159190516490006 | 0VVIHzxWtNOFLtnhjHEKjXaJOSLJfm |
| c  | 2  | -29  | 25305  | -537142430  | -7683452043175617798 | 150 | 31648 | 598822671  | 11759014161799384683 | 0.8315913   | 0.946325164889271   | 9UbObCsVkmYpJGcGrgfK90qOnwb2Lj |
| a  | 1  | -25  | 15295  | 383352709   | 4980135132406487265  | 231 | 102   | 3276123488 | 12763583666216333412 | 0.53796273  | 0.17592486905979987 | XemNcT1xp61xcM1Qz3wZ1VECCnq06O |
| c  | 4  | 123  | 16620  | 852509237   | -3087630526856906991 | 196 | 33715 | 3566741189 | 4546434653720168472  | 0.07606989  | 0.819715865079681   | 8LIh0b6jmDGm87BmIyjdxNIpX4ugjD |
| a  | 5  | -31  | -12907 | 586844478   | -4862189775214031241 | 170 | 28086 | 1013876852 | 11005002152861474932 | 0.35319167  | 0.05573662213439634 | MeSTAXq8gVxVjbEjgkvU9YLte0X9uE |
| a  | 2  | 45   | 15673  | -1899175111 | 398282800995316041   | 99  | 2555  | 145294611  | 8554426087132697832  | 0.17333257  | 0.6405262429561641  | b3b9esRhTzFEawbs6XhpKnD9ojutHB |
| b  | 3  | 17   | 14457  | 670497898   | -2390782464845307388 | 255 | 24770 | 1538863055 | 12662506238151717757 | 0.34077626  | 0.7614304100703713  | 6x93sxYioWuq5c9Kkk8oTAAORM7cH0 |
| e  | 4  | 97   | -13181 | 2047637360  | 6176835796788944083  | 158 | 53000 | 2042457019 | 9726016502640071617  | 0.7085086   | 0.12357539988406441 | oHJMNvWuunsIMIWFnYG31RCfkOo2V7 |
| c  | 2  | -60  | -16312 | -1808210365 | -3368300253197863813 | 71  | 39635 | 2844041986 | 7045482583778080653  | 0.805363    | 0.6425694115212065  | BJqx5WokrmrrezZA0dUbleMYkG5U2O |
| e  | 1  | 36   | -21481 | -928766616  | -3471238138418013024 | 150 | 52569 | 2610290479 | 7788847578701297242  | 0.2578469   | 0.7670021786149205  | gpo8K5qtYePve6jyPt6xgJx4YOVjms |
| b  | 5  | -5   | 24896  | 1955646088  | 2430204191283109071  | 118 | 43655 | 2424630722 | 11429640193932435507 | 0.87989986  | 0.7328050041291218  | JafwVLSVk5AVoXFuzclesQ000EE2k1 |
| a  | 3  | 13   | 32064  | 912707948   | 3826618523497875379  | 42  | 21463 | 2214035726 | 10771380284714693539 | 0.6133468   | 0.7325106678655877  | i6RQVXKUh7MzuGMDaNclUYnFUAireU |
| c  | 1  | 41   | -4667  | -644225469  | 7049620391314639084  | 196 | 48099 | 2125812933 | 15419512479294091215 | 0.5780736   | 0.9255031346434324  | mzbkwXKrPeZnxg2Kn1LRF5hYSsmksS |
| d  | 2  | 93   | -12642 | 2053379412  | 6468763445799074329  | 147 | 50842 | 1000948272 | 5536487915963301239  | 0.4279275   | 0.28534428578703896 | lqhzgLsXZ8JhtpeeUWWNbMz8PHI705 |
| c  | 3  | 73   | -9565  | -382483011  | 1765659477910680019  | 186 | 1535  | 1088543984 | 2906943497598597237  | 0.680652    | 0.6009475544728957  | Ow5PGpfTm4dXCfTDsXAOTatXRoAydR |
| c  | 3  | -2   | -18655 | -2141999138 | -3154042970870838072 | 251 | 34970 | 3862393166 | 13062025193350212516 | 0.034291923 | 0.7697753383420857  | IWl0G3ZlMNf7WT8yjIB49cx7MmYOmr |
| c  | 3  | 22   | 13741  | -2098805236 | 8604102724776612452  | 45  | 2516  | 1362369177 | 196777795886465166   | 0.94669616  | 0.0494924465469434  | 6oIXZuIPIqEoPBvFmbt2Nxy3tryGUE |
| .                                                                                                                                                                                   |
| .                                                                                                                                                                                   |
| .                                                                                                                                                                                   |
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
100 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.037 seconds.