Closed andygrove closed 1 month ago
Hello 👋🏻.
I haven't contributed here before but checked out this issue because of its "Good First Issue" label. Maybe you could help me figure out what the expected behavior is.
The cell's value inside the CSV file (14857091259186476033
) overflows the inferred type Int64
.
So assuming the above query is supposed to work given the aggregate_test_100.csv
file as input, I would suspect the column should be inferred as UInt64
.
To fix this, changes would likely be needed within the arrow
crate, as the CSV infer_schema
functions utilizes the arrow::civ::reader::infer_reader_schema
function.
Do you have any direction on how to proceed?
👋 @bezbac
I think looking into improving the inference in the arrow crate would be a good step
Another potential alternative to using UInt64 might be Decimal128 (though I don't want to mislead you too much here)
There's also a TODO directly related to this that could be cleaned up as part of this PR -> https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/mod.rs#L624
This still happens as of datafusion 42
> CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION 'testing/data/csv/aggregate_test_100.csv';
0 row(s) fetched.
Elapsed 0.009 seconds.
> select * from test;
Arrow error: Parser error: Error while parsing value 11720144131976083864 for column 9 at line 2
Hi, I took a look based on @bezbac's findings, and found that arrow-csv
uses regex expr to match strings and then infer ther types. The regex for Int64
accept all possible numbers instead of numbers in ranges of Int64
, which caused that the inferred type actually cannot accept the strings.
(Seems github cannot attach codes through links that does not belong to this repo, so I copied here)
/// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
/// #L146-L158
/// Order should match [`InferredDataType`]
static ref REGEX_SET: RegexSet = RegexSet::new([
// ...
r"^-?(\d+)$", //INTEGER
// ...
]).unwrap();
/// See https://github.com/apache/arrow-rs/blob/ebcc4a585136cd1d9696c38c41f71c9ced181f57/arrow-csv/src/reader/mod.rs
/// #L214-L223
/// Updates the [`InferredDataType`] with the given string
fn update(&mut self, string: &str) {
self.packed |= if string.starts_with('"') {
1 << 8 // Utf8
} else if let Some(m) = REGEX_SET.matches(string).into_iter().next() {
1 << m
} else {
1 << 8 // Utf8
}
}
One of the solutions is to change the regex expr for Int64
to match the range of it, but it seems to be a very complicate one. I told gpt to give me one but the regex expr's more than 300 chars, so I wonder if there're alternatives :)
I wonder if you could use a length check on the characters too (it wouldn't be perfect, but it might catch more)
As in you could bound the number of characters matched by the regexp to the number of characters in Int64::MAX (20 or so??)
I wonder if you could use a length check on the characters too (it wouldn't be perfect, but it might catch more)
Thanks for your suggestion! But if we want to precisely match it, we have to iter through the prefix of Int64::MAX and then we got:
922337203685477580[0-7]
9223372036854775[0-7]\d{2}
922337203685477[0-4]\d{3}
92233720368547[0-6]\d{4}
and so on...
Perhaps we can use a function for match strings here instead of direct regex here, which seems more flexible for me. If you agree on that, I'll try to submit a pr to arrow-rs
for further discussions :)
Perhaps we can use a function for match strings here instead of direct regex here, which seems more flexible for me. If you agree on that, I'll try to submit a pr to
arrow-rs
for further discussions :)
That seems like a good idea to me -- with sufficient care it should also be significantly faster than a regexp based matching approach.
I verified that this behavior now works well after upgrading to arrow 53.1.0
due to https://github.com/apache/arrow-rs/pull/6481 (thanks @CookiePieWw ❤️ )
DataFusion CLI v42.0.0
> CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '../testing/data/csv/aggregate_test_100.csv'
;
0 row(s) fetched.
Elapsed 0.061 seconds.
> select * from test;
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | c10 | c11 | c12 | c13 |
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
| c | 2 | 1 | 18109 | 2033001162 | -6513304855495910254 | 25 | 43062 | 1491205016 | 5863949479783605708 | 0.110830784 | 0.9294097332465232 | 6WfVFBVGJSQb7FhA7E0lBwdvjfZnSW |
| d | 5 | -40 | 22614 | 706441268 | -7542719935673075327 | 155 | 14337 | 3373581039 | 11720144131976083864 | 0.69632107 | 0.3114712539863804 | C2GT5KVyOPZpgKVl110TyZO0NcJ434 |
| b | 1 | 29 | -18218 | 994303988 | 5983957848665088916 | 204 | 9489 | 3275293996 | 14857091259186476033 | 0.53840446 | 0.17909035118828576 | AyYVExXK6AR2qUTxNZ7qRHQOVGMLcz |
| a | 1 | -85 | -15154 | 1171968280 | 1919439543497968449 | 77 | 52286 | 774637006 | 12101411955859039553 | 0.12285209 | 0.6864391962767343 | 0keZ5G8BffGwgF2RwQD59TFzMStxCB |
| b | 5 | -82 | 22080 | 1824882165 | 7373730676428214987 | 208 | 34331 | 3342719438 | 3330177516592499461 | 0.82634634 | 0.40975383525297016 | Ig1QcuKsjHXkproePdERo2w0mYzIqd |
| b | 4 | -111 | -1967 | -4229382 | 1892872227362838079 | 67 | 9832 | 1243785310 | 8382489916947120498 | 0.06563997 | 0.152498292971736 | Sfx0vxv1skzZWT1PqVdoRDdO6Sb6xH |
| e | 3 | 104 | -25136 | 1738331255 | 300633854973581194 | 139 | 20807 | 3577318119 | 13079037564113702254 | 0.40154034 | 0.7764360990307122 | DuJNG8tufSqW0ZstHqWj3aGvFLMg4A |
| a | 3 | 13 | 12613 | 1299719633 | 2020498574254265315 | 191 | 17835 | 3998790955 | 14881411008939145569 | 0.041445434 | 0.8813167497816289 | Amn2K87Db5Es3dFQO9cw9cvpAM6h35 |
| d | 1 | 38 | 18384 | -335410409 | -1632237090406591229 | 26 | 57510 | 2712615025 | 1842662804748246269 | 0.6064476 | 0.6404495093354053 | 4HX6feIvmNXBN7XGqgO4YVBkhu8GDI |
| a | 4 | -38 | 20744 | 762932956 | 308913475857409919 | 7 | 45465 | 1787652631 | 878137512938218976 | 0.7459874 | 0.02182578039211991 | ydkwycaISlYSlEq3TlkS2m15I2pcp8 |
| d | 1 | 57 | 28781 | -1143802338 | 2662536767954229885 | 202 | 62167 | 879082834 | 4338034436871150616 | 0.7618384 | 0.42950521730777025 | VY0zXmXeksCT8BzvpzpPLbmU9Kp9Y4 |
| a | 4 | -54 | -2376 | 434021400 | 5502271306323260832 | 113 | 15777 | 2502326480 | 7966148640299601101 | 0.5720931 | 0.30585375151301186 | KJFcmTVjdkCMv94wYCtfHMFhzyRsmH |
| e | 3 | 112 | -6823 | -421042466 | 8535335158538929274 | 129 | 32712 | 3759340273 | 9916295859593918600 | 0.6424343 | 0.6316565296547284 | BsM5ZAYifRh5Lw3Y8X1r53I0cTJnfE |
| d | 2 | 113 | 3917 | -108973366 | -7220140168410319165 | 197 | 24380 | 63044568 | 4225581724448081782 | 0.11867094 | 0.2944158618048994 | 90gAtmGEeIqUTbo1ZrxCvWtsseukXC |
| b | 1 | 54 | -18410 | 1413111008 | -7145106120930085900 | 249 | 5382 | 1842680163 | 17818611040257178339 | 0.8881188 | 0.24899794314659673 | 6FPJlLAcaQ5uokyOWZ9HGdLZObFvOZ |
| c | 1 | 103 | -22186 | 431378678 | 1346564663822463162 | 146 | 12393 | 3766999078 | 10901819591635583995 | 0.064453244 | 0.7784918983501654 | 2T3wSlHdEmASmO0xcXHnndkKEt6bz8 |
| e | 2 | 49 | 24495 | -587831330 | 9178511478067509438 | 129 | 12757 | 1289293657 | 10948666249269100825 | 0.5610077 | 0.5991138115095911 | bgK1r6v3BCTh0aejJUhkA1Hn6idXGp |
| d | 1 | -98 | 13630 | -1991133944 | 1184110014998006843 | 220 | 2986 | 225513085 | 9634106610243643486 | 0.89651865 | 0.1640882545084913 | y7C453hRWd4E7ImjNDWlpexB8nUqjh |
| d | 3 | 77 | 15091 | -1302295658 | 8795481303066536947 | 154 | 35477 | 2093538928 | 17419098323248948387 | 0.11952883 | 0.7035635283169166 | O66j6PaYuZhEUtqV6fuU7TyjM2WxC5 |
| e | 2 | 97 | 18167 | 1593800404 | -9112448817105133638 | 163 | 45185 | 3188005828 | 2792105417953811674 | 0.38175434 | 0.4094218353587008 | ukOiFGGFnQJDHFgZxHMpvhD3zybF0M |
| e | 4 | -56 | -31500 | 1544188174 | 3096047390018154410 | 220 | 417 | 557517119 | 2774306934041974261 | 0.15459597 | 0.19113293583306745 | IZTkHMLvIKuiLjhDjYMmIHxh166we4 |
| d | 1 | -99 | 5613 | 1213926989 | -8863698443222021480 | 19 | 18736 | 4216440507 | 14933742247195536130 | 0.6067944 | 0.33639590659276175 | aDxBtor7Icd9C5hnTvvw5NrIre740e |
| a | 5 | 36 | -16974 | 623103518 | 6834444206535996609 | 71 | 29458 | 141047417 | 17448660630302620693 | 0.17100024 | 0.04429073092078406 | OF7fQ37GzaZ5ikA2oMyvleKtgnLjXh |
| e | 4 | -53 | 13788 | 2064155045 | -691093532952651300 | 243 | 35106 | 2778168728 | 9463973906560740422 | 0.34515214 | 0.27159190516490006 | 0VVIHzxWtNOFLtnhjHEKjXaJOSLJfm |
| c | 2 | -29 | 25305 | -537142430 | -7683452043175617798 | 150 | 31648 | 598822671 | 11759014161799384683 | 0.8315913 | 0.946325164889271 | 9UbObCsVkmYpJGcGrgfK90qOnwb2Lj |
| a | 1 | -25 | 15295 | 383352709 | 4980135132406487265 | 231 | 102 | 3276123488 | 12763583666216333412 | 0.53796273 | 0.17592486905979987 | XemNcT1xp61xcM1Qz3wZ1VECCnq06O |
| c | 4 | 123 | 16620 | 852509237 | -3087630526856906991 | 196 | 33715 | 3566741189 | 4546434653720168472 | 0.07606989 | 0.819715865079681 | 8LIh0b6jmDGm87BmIyjdxNIpX4ugjD |
| a | 5 | -31 | -12907 | 586844478 | -4862189775214031241 | 170 | 28086 | 1013876852 | 11005002152861474932 | 0.35319167 | 0.05573662213439634 | MeSTAXq8gVxVjbEjgkvU9YLte0X9uE |
| a | 2 | 45 | 15673 | -1899175111 | 398282800995316041 | 99 | 2555 | 145294611 | 8554426087132697832 | 0.17333257 | 0.6405262429561641 | b3b9esRhTzFEawbs6XhpKnD9ojutHB |
| b | 3 | 17 | 14457 | 670497898 | -2390782464845307388 | 255 | 24770 | 1538863055 | 12662506238151717757 | 0.34077626 | 0.7614304100703713 | 6x93sxYioWuq5c9Kkk8oTAAORM7cH0 |
| e | 4 | 97 | -13181 | 2047637360 | 6176835796788944083 | 158 | 53000 | 2042457019 | 9726016502640071617 | 0.7085086 | 0.12357539988406441 | oHJMNvWuunsIMIWFnYG31RCfkOo2V7 |
| c | 2 | -60 | -16312 | -1808210365 | -3368300253197863813 | 71 | 39635 | 2844041986 | 7045482583778080653 | 0.805363 | 0.6425694115212065 | BJqx5WokrmrrezZA0dUbleMYkG5U2O |
| e | 1 | 36 | -21481 | -928766616 | -3471238138418013024 | 150 | 52569 | 2610290479 | 7788847578701297242 | 0.2578469 | 0.7670021786149205 | gpo8K5qtYePve6jyPt6xgJx4YOVjms |
| b | 5 | -5 | 24896 | 1955646088 | 2430204191283109071 | 118 | 43655 | 2424630722 | 11429640193932435507 | 0.87989986 | 0.7328050041291218 | JafwVLSVk5AVoXFuzclesQ000EE2k1 |
| a | 3 | 13 | 32064 | 912707948 | 3826618523497875379 | 42 | 21463 | 2214035726 | 10771380284714693539 | 0.6133468 | 0.7325106678655877 | i6RQVXKUh7MzuGMDaNclUYnFUAireU |
| c | 1 | 41 | -4667 | -644225469 | 7049620391314639084 | 196 | 48099 | 2125812933 | 15419512479294091215 | 0.5780736 | 0.9255031346434324 | mzbkwXKrPeZnxg2Kn1LRF5hYSsmksS |
| d | 2 | 93 | -12642 | 2053379412 | 6468763445799074329 | 147 | 50842 | 1000948272 | 5536487915963301239 | 0.4279275 | 0.28534428578703896 | lqhzgLsXZ8JhtpeeUWWNbMz8PHI705 |
| c | 3 | 73 | -9565 | -382483011 | 1765659477910680019 | 186 | 1535 | 1088543984 | 2906943497598597237 | 0.680652 | 0.6009475544728957 | Ow5PGpfTm4dXCfTDsXAOTatXRoAydR |
| c | 3 | -2 | -18655 | -2141999138 | -3154042970870838072 | 251 | 34970 | 3862393166 | 13062025193350212516 | 0.034291923 | 0.7697753383420857 | IWl0G3ZlMNf7WT8yjIB49cx7MmYOmr |
| c | 3 | 22 | 13741 | -2098805236 | 8604102724776612452 | 45 | 2516 | 1362369177 | 196777795886465166 | 0.94669616 | 0.0494924465469434 | 6oIXZuIPIqEoPBvFmbt2Nxy3tryGUE |
| . |
| . |
| . |
+----+----+------+--------+-------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+
100 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.037 seconds.
Describe the bug
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.