elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

DataFrame.load_csv!/2 seems to fail on certain options #963

Closed chgeuer closed 3 months ago

chgeuer commented 3 months ago

After downloading a CSV from the web, I want to load it into a DataFrame.

  1. Loading the CSV directly from memory using DataFrame.load_csv!/2 results in a RuntimeError.
  2. Storing the CSV binary in a temporary file, and then loading it using DataFrame.from_csv!/2 works fine.

My assumption was that load_csv! and from_csv! should behave somewhat identical (except touching disk)

Below code with load_csv!/2 gives me

%RuntimeError{
   message: "load_csv failed: 
       %RuntimeError{
         message: "Polars Error: found more fields than defined in 'Schema'
        Consider setting 'truncate_ragged_lines=true'.

Here's a quick repro for LiveBook:

Mix.install([
  {:req, "~> 0.5.6"},
  {:explorer, "~> 0.9.0"},
  {:kino_explorer, "~> 0.1.21"},
  {:iconv, "~> 1.0"}
])

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/precipitation/recent/stundenwerte_RR_01078_akt.zip"

%Req.Response{status: 200, body: zip} = Req.get!(url: url)

csv = zip |> Enum.into(%{}, fn {name, content} ->
  {to_string(name), :iconv.convert("iso8859-15", "utf-8", content)}
end)
|> Enum.find(fn {name, _} -> String.starts_with?(name, "produkt") end)
|> elem(1)

csv_opts = [
  header: true,
  delimiter: ";",
  infer_schema_length: 10,
  nil_values: for(n <- 0..10, do: String.duplicate(" ", n) <> "-999"),
  dtypes: [
    {"STATIONS_ID", {:u, 16}},
    {"MESS_DATUM", :string},
    {"QN_8", {:u, 16}},
    {"  R1", {:f, 32}},
    {"RS_IND", {:f, 32}},
    {"WRTR", {:f, 32}}
  ],
]

try do
  DF.load_csv!(csv, csv_opts)
rescue
  err -> IO.puts("load_csv!/2 resulted in #{inspect err}")
end

File.write!("1.csv", csv)
df = DF.from_csv!("1.csv", csv_opts)
IO.puts("from_csv!/2 works well")
df |> DF.dtypes()

Maybe that's not a bug but a user error on my side. Is there a better place to ask that question?

josevalim commented 3 months ago

Which version are you using? We had a bug for this but it was fixed on 0.9.1, try forcing ~> 0.9.1 instead.

chgeuer commented 3 months ago

Thanks, @josevalim , that works perfectly in v0.9.1... Closing this issue