Closed billylanchantin closed 9 months ago
This issue is a blocker for the property test PR. Should I fix it there or do it separately?
Lists in polars are tricky:
For [[] , [[]] , [[[[]]]], [[2]] ] => dtype will be list[list[s64]]
[[[[]]]] => dtype will be list[list[list[list[null]]]]
What is the rationale behind this, any clue ?
@lkarthee Interesting I didn't realize. Is this the behavior of Rust Polars or Python Polars?
I have tested it in py polars - from your rust example it looks like rust is no different from py.
I ask because I'm not sure you could make it in Rust because the compiler might prevent you.
Also, what's resulting series from [[] , [[]] , [[[[]]]], [[2]]]
? I'd assume this:
[[] , [[]] , [[]], [[2]]]
^^^^
Inner nesting removed
Otherwise the dtypes wouldn't match, right?
I'll try and test these in Rust later today.
Yeah true.
s = pl.Series([[[[[[[]]]]]], [], [[]]])
shape: (3,)
Series: '' [list[list[list[list[list[list[null]]]]]]]
[
[[[[[[]]]]]]
[]
[[]]
]
s = pl.Series([[[[[[[]]]]]], [], [[]], [[]], [[0], [], []]])
shape: (5,)
Series: '' [list[list[i64]]]
[
[null]
[]
[[]]
[[]]
[[0], [], []]
]
Interesting to note that first element reduced to [null] not [[]].
Interesting to note that first element reduced to [null] not [[]].
Hm yeah I would've expected [[]]
or [[null]]
, not that.
Hm yeah I would've expected [[]] or [[null]], not that.
That can potentially be a polars bug?
Yeah it might be a bug. Although it's technically legal since every element can potentially be null
. Who knows, maybe it's actually the correct way to do it for some non-obvious reason 🤷
(But I think bug is most likely)
s = pl.Series([[], [[[3]]], [[2], [2]]])
shape: (3,)
Series: '' [list[list[list[i64]]]]
[
[]
[[[3]]]
[[[2]], [[2]]]
]
This makes me believe there is a bug - a list 2 level nesting [[2]] is converted to 3 level nesting [[[2]]]. Better to log an issue with polars before attempting a fix ? It only considers dtype of first non empty element.
s = pl.Series([[], [3], [[2], [2]]])
shape: (3,)
Series: '' [list[i64]]
[
[]
[3]
[null, null]
]
I started to make an issue, but now I think Polars is more correct than we initially thought. Consider this example:
pl.Series([["a"], [1]])
# shape: (2,)
# Series: '' [list[str]]
# [
# ["a"]
# ["1"]
# ]
pl.Series([[1], ["a"]])
# shape: (2,)
# Series: '' [list[i64]]
# [
# [1]
# [null]
# ]
The dtypes of these series are either list[str]
or list[i64]
. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.
However, I still can't account for some other stuff we're seeing. These cases still seem wrong even if we use the first-non-empty-element-wins rule:
# Wrapping elements in additional layers of nesting to make the dtypes compatible
pl.Series([[[2]], [1, 1]])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2]]
# [[1], [1]]
# ]
# An empty list as the errant element pushes the null up one layer of nesting
pl.Series([ [[2, 2]], [["b"]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [[null]]
# ]
pl.Series([ [[2, 2]], [[[]]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [null]
# ]
Then there's this:
pl.Series([ [[2, 2]], [[2, 2]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
# [[2, 2]]
# [[2, 2]]
# ]
pl.Series([ [[2, 2]], [[2, []]] ])
# shape: (2,)
# Series: '' [o][object]
# [
# [[2, 2]]
# [[2, []]]
# ]
Not sure what to make of that. Does anyone know if the intended behavior is documented anywhere? I didn't see it in the Python docs, but I might've just missed it.
The dtypes of these series are either list[str] or list[i64]. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.
You are right, that makes sense. This means we should raise on these cases, including in several of the examples below:
pl.Series([[[2]], [1, 1]])
Should raise.
pl.Series([ [[2, 2]], [["b"]] ])
Should raise.
pl.Series([ [[2, 2]], [[[]]] ])
Should raise. You can think that [[[]]]
means it is at least a three element deep list, which violates the first value.
pl.Series([ [[2, 2]], [[2, []]] ])
Should raise.
The example you originally reported is still wrong and has to be fixed though. [[], [[]], [[1]]]
means that the 1+ level list, then 2+ level list, and officially confirmed with a {:list, {:list, :s64}}
list. I can try to work on a PR, since I wrote the current version of the code.
@josevalim I think Polars is making non-matching elements null
instead of raising. So for:
pl.Series([[[2]], [1, 1]])
Should raise.
I think it should yield [[[2]], [null, null]]
instead since each 1
it encounters ought to be a list. I have no idea why they decided to wrap each 1
to yield [[[2]], [[1], [1]]]
.
That said, we can also chose to differ from how Python Polars is resolving these situations by raising.
I think we should be stricter, unless we have a valid use case. Returning nil
or wrapping are both undesired IMO.
Found via:
This panics:
with:
It appears that this is a legal series:
https://www.rustexplorer.com/b/0ga2cs