elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.13k stars 123 forks source link

Nested lists of lists can panic #857

Closed billylanchantin closed 9 months ago

billylanchantin commented 9 months ago

Found via:

This panics:

test "list of list of lists of integers with two levels of empty lists" do
  values = [[], [[]], [[1]]]
  series = Series.from_list(values)

  assert series.dtype == {:list, {:list, {:s, 64}}}
  assert Series.to_list(series) == values
end

with:

  1) test from_list/2 list of list of lists of integers with two levels of empty lists (Explorer.Series.ListTest)
     test/explorer/series/list_test.exs:122
     ** (ErlangError) Erlang error: :nif_panicked
     code: series = Series.from_list(values)
     stacktrace:
       (explorer 0.9.0-dev) Explorer.PolarsBackend.Native.s_from_list_of_series("", [#Explorer.PolarsBackend.Series<
    shape: (0,)
    Series: '' [list[null]]
    [
    ]
  >, #Explorer.PolarsBackend.Series<
    shape: (1,)
    Series: '' [list[i64]]
    [
        []
    ]
  >, #Explorer.PolarsBackend.Series<
    shape: (1,)
    Series: '' [list[i64]]
    [
        [1]
    ]
  >])
       (explorer 0.9.0-dev) lib/explorer/polars_backend/series.ex:24: Explorer.PolarsBackend.Series.from_list/2
       (explorer 0.9.0-dev) lib/explorer/series.ex:418: Explorer.Series.from_list/2
       test/explorer/series/list_test.exs:124: (test)

It appears that this is a legal series:

https://www.rustexplorer.com/b/0ga2cs

shape: (3,)
Series: 's' [list[list[i32]]]
[
    []
    [[]]
    [[1]]
]
billylanchantin commented 9 months ago

This issue is a blocker for the property test PR. Should I fix it there or do it separately?

lkarthee commented 9 months ago

Lists in polars are tricky:

For [[] , [[]] , [[[[]]]], [[2]] ] => dtype will be list[list[s64]]

[[[[]]]] => dtype will be list[list[list[list[null]]]]

What is the rationale behind this, any clue ?

billylanchantin commented 9 months ago

@lkarthee Interesting I didn't realize. Is this the behavior of Rust Polars or Python Polars?

lkarthee commented 9 months ago

I have tested it in py polars - from your rust example it looks like rust is no different from py.

billylanchantin commented 9 months ago

I ask because I'm not sure you could make it in Rust because the compiler might prevent you.

Also, what's resulting series from [[] , [[]] , [[[[]]]], [[2]]]? I'd assume this:

[[] , [[]] , [[]], [[2]]] 
             ^^^^
             Inner nesting removed

Otherwise the dtypes wouldn't match, right?

I'll try and test these in Rust later today.

lkarthee commented 9 months ago

Yeah true.

s = pl.Series([[[[[[[]]]]]], [], [[]]])
shape: (3,)
Series: '' [list[list[list[list[list[list[null]]]]]]]
[
    [[[[[[]]]]]]
    []
    [[]]
]
s = pl.Series([[[[[[[]]]]]], [], [[]], [[]], [[0], [], []]])
shape: (5,)
Series: '' [list[list[i64]]]
[
    [null]
    []
    [[]]
    [[]]
    [[0], [], []]
]

Interesting to note that first element reduced to [null] not [[]].

billylanchantin commented 9 months ago

Interesting to note that first element reduced to [null] not [[]].

Hm yeah I would've expected [[]] or [[null]], not that.

josevalim commented 9 months ago

Hm yeah I would've expected [[]] or [[null]], not that.

That can potentially be a polars bug?

billylanchantin commented 9 months ago

Yeah it might be a bug. Although it's technically legal since every element can potentially be null. Who knows, maybe it's actually the correct way to do it for some non-obvious reason 🤷

(But I think bug is most likely)

lkarthee commented 9 months ago
s = pl.Series([[], [[[3]]], [[2], [2]]])
shape: (3,)
Series: '' [list[list[list[i64]]]]
[
    []
    [[[3]]]
    [[[2]], [[2]]]
]

This makes me believe there is a bug - a list 2 level nesting [[2]] is converted to 3 level nesting [[[2]]]. Better to log an issue with polars before attempting a fix ? It only considers dtype of first non empty element.

s = pl.Series([[], [3], [[2], [2]]])
shape: (3,)
Series: '' [list[i64]]
[
    []
    [3]
    [null, null]
]
billylanchantin commented 9 months ago

I started to make an issue, but now I think Polars is more correct than we initially thought. Consider this example:

pl.Series([["a"], [1]])
# shape: (2,)
# Series: '' [list[str]]
# [
#         ["a"]
#         ["1"]
# ]

pl.Series([[1], ["a"]])
# shape: (2,)
# Series: '' [list[i64]]
# [
#         [1]
#         [null]
# ]

The dtypes of these series are either list[str] or list[i64]. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.

However, I still can't account for some other stuff we're seeing. These cases still seem wrong even if we use the first-non-empty-element-wins rule:

# Wrapping elements in additional layers of nesting to make the dtypes compatible
pl.Series([[[2]], [1, 1]])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
#         [[2]]
#         [[1], [1]]
# ]

# An empty list as the errant element pushes the null up one layer of nesting
pl.Series([ [[2, 2]], [["b"]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
#         [[2, 2]]
#         [[null]]
# ]

pl.Series([ [[2, 2]], [[[]]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
#         [[2, 2]]
#         [null]
# ]

Then there's this:

pl.Series([ [[2, 2]], [[2, 2]] ])
# shape: (2,)
# Series: '' [list[list[i64]]]
# [
#         [[2, 2]]
#         [[2, 2]]
# ]

pl.Series([ [[2, 2]], [[2, []]] ])
# shape: (2,)
# Series: '' [o][object]
# [
#         [[2, 2]]
#         [[2, []]]
# ]

Not sure what to make of that. Does anyone know if the intended behavior is documented anywhere? I didn't see it in the Python docs, but I might've just missed it.

josevalim commented 9 months ago

The dtypes of these series are either list[str] or list[i64]. Polars needs a tie-breaker, so letting the first element decide seems like a sensible choice. I think that's what's happening with some of our examples.

You are right, that makes sense. This means we should raise on these cases, including in several of the examples below:

pl.Series([[[2]], [1, 1]])

Should raise.

pl.Series([ [[2, 2]], [["b"]] ])

Should raise.

pl.Series([ [[2, 2]], [[[]]] ])

Should raise. You can think that [[[]]] means it is at least a three element deep list, which violates the first value.

pl.Series([ [[2, 2]], [[2, []]] ])

Should raise.


The example you originally reported is still wrong and has to be fixed though. [[], [[]], [[1]]] means that the 1+ level list, then 2+ level list, and officially confirmed with a {:list, {:list, :s64}} list. I can try to work on a PR, since I wrote the current version of the code.

billylanchantin commented 9 months ago

@josevalim I think Polars is making non-matching elements null instead of raising. So for:

pl.Series([[[2]], [1, 1]])

Should raise.

I think it should yield [[[2]], [null, null]] instead since each 1 it encounters ought to be a list. I have no idea why they decided to wrap each 1 to yield [[[2]], [[1], [1]]].

That said, we can also chose to differ from how Python Polars is resolving these situations by raising.

josevalim commented 9 months ago

I think we should be stricter, unless we have a valid use case. Returning nil or wrapping are both undesired IMO.