Performance of `DataFrame.new/2` on dataframes containing list columns

spencerkent commented 5 months ago

I'm noticing that creation of a new DataFrame that has a list column is veeeerrry sloooow (in Explorer 0.8.2), so thought I'd show a little benchmark and see if there's any ideas how to make it faster :)

Starting with two objects that support Table.Reader, one of them has no list columns, the other does (synthetic data):

iex>IO.inspect(tr_object_no_list_columns)
[
  %{
    billing_code: "E1070",
    billing_code_type_version: "2024",
    billing_code_description: "rxndfe4t",
    negotiated_rate: 4348.933298145181
  },
  %{
    billing_code: "G9143",
    billing_code_type_version: "2024",
    billing_code_description: "3dc8r70lej2g",
    negotiated_rate: 3008.7836399549005
  },
  %{
    billing_code: "G0482",
    billing_code_type_version: "2024",
    billing_code_description: "bhq0y163",
    negotiated_rate: 7891.783042709335
  }
...
]

iex>IO.inspect(tr_object_with_list_columns)
[
  %{
    billing_code: "G9143",
    billing_code_type_version: "2024",
    service_codes: ["57", "81", "50", "34", "32"],
    negotiated_rate: 4596.469175710615
  },
  %{
    billing_code: "A4220",
    billing_code_type_version: "2024",
    service_codes: ["71"],
    negotiated_rate: 2718.823985442407
  },
  %{
    billing_code: "A4220",
    billing_code_type_version: "2024",
    service_codes: ["62", "1", "56", "18", "23", "13", "14", "19"],
    negotiated_rate: 7437.797724472413
  }
...
]

These objects are essentially the same size:

iex>IO.puts("The size of the object without a list column is: #{
  :erlang.term_to_binary(tr_object_no_list_columns) 
  |> :erlang.byte_size()) / 1_000_000} MB"
)
"The size of the object without a list column is: 0.14509 MB"

iex>IO.puts("The size of the object *with* a list column is: #{
  :erlang.term_to_binary(tr_object_with_list_columns) 
  |> :erlang.byte_size()) / 1_000_000} MB"
)
"The size of the object *with* a list column is: 0.14586 MB"

The comparison:

# manually-specifying the types, just in case
iex>dtypes = %{
  no_list: %{
    billing_code: :string,
    billing_code_type_version: :string,
    billing_code_description: :string,
    negotiated_rate: {:f, 32}
  },
  with_list: %{
    billing_code: :string,
    billing_code_type_version: :string,
    negotiated_rate: {:f, 32},
    service_codes: {:list, :string},
  }
}

iex>Benchee.run(
  %{
    "no_list_column" => fn -> Explorer.DataFrame.new(tr_object_no_list_columns, dtypes: dtypes.no_list) end,
    "with_list_column" => fn -> Explorer.DataFrame.new(tr_object_with_list_columns, dtypes: dtypes.with_list) end,
  },
  time: 60,
  memory_time: 2
)

Output from Benchee:

Warning: the benchmark no_list_column is using an evaluated function.
  Evaluated functions perform slower than compiled functions.
  You can move the Benchee caller to a function in a module and invoke `Mod.fun()` instead.
  Alternatively, you can move the benchmark into a benchmark.exs file and run mix run benchmark.exs

Warning: the benchmark with_list_column is using an evaluated function.
  Evaluated functions perform slower than compiled functions.
  You can move the Benchee caller to a function in a module and invoke `Mod.fun()` instead.
  Alternatively, you can move the benchmark into a benchmark.exs file and run mix run benchmark.exs

Operating System: Linux
CPU Information: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
Number of Available Cores: 16
Available memory: 19.53 GB
Elixir 1.16.0
Erlang 26.2.1
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 1 min
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 2 min 8 s

Benchmarking no_list_column ...
Warning: The function you are trying to benchmark is super fast, making measurements more unreliable!
This holds especially true for memory measurements or when running with hooks.

See: https://github.com/bencheeorg/benchee/wiki/Benchee-Warnings#fast-execution-warning

You may disable this warning by passing print: [fast_warning: false] as configuration options.

Benchmarking with_list_column ...
Calculating statistics...
Formatting results...

Name                       ips        average  deviation         median         99th %
no_list_column          917.12        1.09 ms    ±11.38%        1.08 ms        1.40 ms
with_list_column          9.50      105.27 ms     ±4.16%      104.67 ms      118.95 ms

Comparison: 
no_list_column          917.12
with_list_column          9.50 - 96.55x slower +104.18 ms

Memory usage statistics:

Name                Memory usage
no_list_column         465.27 KB
with_list_column       612.75 KB - 1.32x memory usage +147.48 KB

**All measurements for memory usage were the same**

billylanchantin commented 5 months ago

Interesting. ~100x slower is certainly surprising. I think the {:list, :string} dtype needs a fair bit of tricky memory allocation which might account for the difference, but I wouldn't expect it to be that stark.

By chance, are you able to test this same thing in Polars? Python Polars would be easiest.

josevalim commented 5 months ago

@billylanchantin we had some past discussions about this. The main issue I think is that we always traverse the given data, even if a dtype is provided, and we should have a mechanism to say: "this is the dtype, don't try to check or cast it", and we send the data as is, without inferring or casting. The big question is: which one do we want as default? To cast or not to cast? We can probably do this in time for the upcoming v0.9, it is not a big change.

The improvements from #863 should also help here.

billylanchantin commented 5 months ago

@josevalim Ah gotcha, thanks. Yeah I see #863 has the todo:

[ ] Remove the type checking Elixir-side

When you say "it is not a big change", do you mean finishing #863 isn't big? Or adding a separate mechanism to skip the type check isn't big?

josevalim commented 5 months ago

I am sure how hard the PR is, I meant the removing casting the casting and making the inference on Elixir side optional.

I pushed a commit that removed the casting. This commit removes the casting: https://github.com/elixir-explorer/explorer/commit/c5c2eb7b215be0d33ef6e28fbc06c02209c75ccb

And this PR removes the inference: https://github.com/elixir-explorer/explorer/pull/923 - @spencerkent, can you please try the PR and let us know how it impacts the numbers? Altogether, those changes should reduce at least two traversals from the Elixir side.

spencerkent commented 5 months ago

Hi all thanks for the quick followup! I was out of commission this weekend but had a chance to test the PR #923 this morning, and it looks like it does make a large improvement:

Operating System: Linux                                                                                       
CPU Information: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz                                                     
Number of Available Cores: 16                                                                                 
Available memory: 19.53 GB                                                                                    
Elixir 1.16.0                                                                                                 
Erlang 26.2.1                                                                                                 
JIT enabled: true                                                                                             

Benchmark suite executing with the following configuration:                                                   
warmup: 2 s                                                                                                   
time: 1 min                                                                                                   
memory time: 2 s                                                                                              
reduction time: 0 ns                                                                                          
parallel: 1                                                                                                   
inputs: none specified                                                                                        
Estimated total run time: 2 min 8 s                                                                           

Benchmarking no_list_column ...                                                                               
Benchmarking with_list_column ...                                                                             
Calculating statistics...                                                                                     
Formatting results...                                                                                         

Name                       ips        average  deviation         median         99th %                        
no_list_column          366.50        2.73 ms    ±54.29%        2.42 ms        7.29 ms                        
with_list_column         60.24       16.60 ms    ±30.71%       15.37 ms       28.83 ms                        

Comparison:                                                                                                   
no_list_column          366.50                                                                                
with_list_column         60.24 - 6.08x slower +13.87 ms                                                       

Memory usage statistics:                                                                                      

Name                Memory usage                                                                              
no_list_column         449.64 KB                                                                              
with_list_column       558.02 KB - 1.24x memory usage +108.38 KB                                              

**All measurements for memory usage were the same**

So, still slower, but much better :)

josevalim commented 5 months ago

Hi @spencerkent, feel free to try main, we merged that PR and a couple other improvements by @philss. We have one more down the pipeline. :)

josevalim commented 5 months ago

Closing this based on the latest PRs. Maybe there are some specific lists optimizations we could look into but there should be a sizable improvement right now. Thank you!

elixir-explorer / explorer

Performance of `DataFrame.new/2` on dataframes containing list columns #922