elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 122 forks source link

DataFrame cast string to Time #119

Closed hoetmaaiers closed 2 years ago

hoetmaaiers commented 2 years ago

I am kinda stuck with a simple problem. Since I can't find the solution in the documentation and SO is quiet, maybe this is the place? Even if only this issue serves as documentation for someone else, it might be worth it?

So my simplified scenario:

I try to cast an Explorer.Series row from a string to a Time type.

df = Explorer.DataFrame.from_map(%{a: ["00:30:00", "01:00:00", "05:30:00"]})

transform_duration = fn duration ->
  time = Time.from_iso8601!(duration)
  time.hour + (time.minute / 60) + (time.second / 60*60)
end

DataFrame.mutate(df, a: Series.transform(s, fn x -> transform_duration.(x) end))

The DataFrame.mutate function won't allow me to transform to a different type. Should I combine this Series.transform with a Series.cast?

Maybe my whole approach ain't good? It ain't working in any way I'v tried to grasp the idea.

philss commented 2 years ago

Hi @hoetmaaiers :wave:

I believe you can't change the type of the series in the transformation. Since you are trying to execute Elixir code for this, how about using Enum.map/2 for the transformation?

Something like this:

alias Explorer.{DataFrame, Series}

df = DataFrame.from_map(%{a: ["00:30:00", "01:00:00", "05:30:00"]})

transform_duration = fn duration ->
  time = Time.from_iso8601!(duration)
  time.hour + (time.minute / 60) + (time.second / 60*60)
end

new_a_series = df["a"] |> Series.to_list() |> Enum.map(transform_duration) |> Series.from_list()

DataFrame.mutate(df, a: new_a_series)

WDYT?

PS: you don't need that |> Series.from_list() part for this to work.

cigrainger commented 2 years ago

This is a bug! We shouldn't assume you're not changing dtypes when using Series.transform/2. I'll fix it. :+1:

hoetmaaiers commented 2 years ago

Thank you @philss , thinking outside Explorer with regular Elixir, why didn't I think of this myself...

@cigrainger, does this mean @philss isn't the only way of doing this?

cigrainger commented 2 years ago

@hoetmaaiers the solution suggested by @philss is basically what Explorer is doing under the hood anyway. The fix I pushed today means your original code will work now -- it was a bug I introduced when implementing Series.transform/2. I accidentally had it so there was an assumption that transform would retain the original dtype and that's not every useful 😀. So now we check the dtype of the new list before creating a series from it.

hoetmaaiers commented 2 years ago

The combination of DataFrame.mutate and Series.transform isn't working for me. Probably it is my bad, but the documentation seems to miss this combination. What am I doing wrong?

df = DataFrame.from_map(%{a: ["00:30:00", "01:00:00", "05:30:00"]})

transform_duration = fn duration ->
  time = Time.from_iso8601!(duration)
  time.hour + (time.minute / 60) + (time.second / 60*60)
end

DataFrame.mutate(df, a: &Series.transform(&transform_duration.(&1["a"])))

Returns me this compile error: ** (CompileError) dataframe.exs:36: nested captures via & are not allowed: &transform_duration.(&1["a"])

josevalim commented 2 years ago

You are not allowed to use & inside &, that’s what the compile error is telling you. Could the error message have been clearer in this case?

hoetmaaiers commented 2 years ago

No the error message is clear, no doubt. I'm struggling with the proper combination of transform and mutate and tried several approaches.

Maybe the documentation for this use case can be clearer? I based myself on the notebook example in this repo.