elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

LazyFrame from http endpoint #993

Closed ceyhunkerti closed 1 month ago

ceyhunkerti commented 1 month ago

Tried to add a simple test like below but it's hanging

  describe "from_parquet/2 - HTTP" do
    setup do
      [bypass: Bypass.open(), df: Explorer.Datasets.wine()]
    end

    test "reads a parquet file from an HTTP server", %{bypass: bypass, df: df} do
      Bypass.expect(bypass, "GET", "/path/to/file.parquet", fn conn ->
        bytes = Explorer.DataFrame.dump_parquet!(df)
        Plug.Conn.resp(conn, 200, bytes)
      end)

      url = http_endpoint(bypass.port) <> "/path/to/file.parquet"

      assert {:ok, ldf} = Explorer.DataFrame.from_parquet(url, lazy: true)
      df1 = DF.compute(ldf)

      assert DF.to_columns(df1) == DF.to_columns(df)
    end
  end

  defp http_endpoint(port), do: "http://localhost:#{port}"

Eventually failing with something like

     test/explorer/data_frame/lazy_test.exs:223
     match (=) failed
     code:  assert {:ok, ldf} = DF.from_parquet(url, lazy: true)
     left:  {:ok, ldf}
     right: {
              :error,
              %RuntimeError{message: "Polars Error: Generic HTTP error: Request error: Error after 2 retries in 180.2322859s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://localhost:46017/test): 'parquet scan' failed: 'select' input failed to resolve"}
            }
     stacktrace:
       test/explorer/data_frame/lazy_test.exs:232: (test)
philss commented 1 month ago

! Couldn't add test maybe someone more knowledgeable on the repo can point me how to add it the Bypass setup doesn't seem to work with lazy option.

I think the problem may be to a restriction in Polars: it may require HTTPs. If you tested manually and it's working, we can move on without the test :)

ceyhunkerti commented 1 month ago

yes, I tested it with the following from the original issue and it's working

frame = Explorer.DataFrame.from_parquet!("https://huggingface.co/datasets/aqubed/kub_tickets_small/resolve/main/data/train-00000-of-00001-47868532d4f55873.parquet", lazy: true)
philss commented 1 month ago

@ceyhunkerti awesome! Thank you!

josevalim commented 1 month ago

:green_heart: :blue_heart: :purple_heart: :yellow_heart: :heart: