apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

array_agg panics on large rows. #6524

Open dadepo opened 1 year ago

dadepo commented 1 year ago

Describe the bug

    let df = ctx.sql(r#"
    SELECT count(payload) from demo
    "#).await?;

+---------------------+
| COUNT(demo.payload) |
+---------------------+
| 68934                          |
+---------------------+

Then

    let df = ctx.sql(r#"
    SELECT array_agg(payload) from demo
    "#).await?;

After running for about 2 mins, panics

thread 'main' panicked at 'attempt to add with overflow', /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/panicking.rs:114:5
   3: comfy_table::utils::ColumnDisplayInfo::width
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
   4: comfy_table::utils::formatting::borders::draw_top_border
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/formatting/borders.rs:54:40
   5: comfy_table::utils::formatting::borders::draw_borders
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/formatting/borders.rs:21:20
   6: comfy_table::utils::build_table
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:51:5
   7: comfy_table::table::Table::lines
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/table.rs:95:9
   8: <comfy_table::table::Table as core::fmt::Display>::fmt
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/table.rs:47:25
   9: core::fmt::write
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/fmt/mod.rs:1230:17
  10: std::io::Write::write_fmt
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/mod.rs:1682:15
  11: <&std::io::stdio::Stdout as std::io::Write>::write_fmt
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:715:9
  12: <std::io::stdio::Stdout as std::io::Write>::write_fmt
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:689:9
  13: std::io::stdio::print_to
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1007:21
  14: std::io::stdio::_print
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1074:5
  15: arrow_cast::pretty::print_batches
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-38.0.0/src/pretty.rs:62:5
  16: datafusion::dataframe::DataFrame::show::{{closure}}
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:682:12
  17: query_playground::main::{{closure}}
             at ./src/query_playground.rs:785:14
  18: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/park.rs:283:63
  19: tokio::runtime::coop::with_budget
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/coop.rs:102:5
  20: tokio::runtime::coop::budget
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/coop.rs:68:5
  21: tokio::runtime::park::CachedParkThread::block_on
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/park.rs:283:31
  22: tokio::runtime::context::BlockingRegionGuard::block_on
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/context.rs:315:13
  23: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/scheduler/multi_thread/mod.rs:66:9
  24: tokio::runtime::runtime::Runtime::block_on
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.25.0/src/runtime/runtime.rs:284:45
  25: query_playground::main
             at ./src/query_playground.rs:812:5
  26: core::ops::function::FnOnce::call_once
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

To Reproduce

Create a table with a large number of rows and use array_agg on one of its columns

Expected behavior

array_agg should not panic on large rows.

Additional context

datafusion = { version = "24.0.0" }
comphead commented 1 year ago

@dadepo are you trying to show the result? reg to the stacktrace it failed on console output , not on the computation itself

14: std::io::stdio::_print
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/io/stdio.rs:1074:5
  15: arrow_cast::pretty::print_batches
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-38.0.0/src/pretty.rs:62:5
  16: datafusion::dataframe::DataFrame::show::{{closure}}
             at /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:682:12
alamb commented 1 year ago

This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔

comphead commented 1 year ago

This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔

That gives the idea to show only first N rows and make a param on that.

dadepo commented 1 year ago

@dadepo are you trying to show the result? reg to the stacktrace it failed on console output , not on the computation itself

@comphead I just checked again, and yeah it blows up with the df.show().await? call.

df.collect().await? on the other hand, does not give the error

thomas-k-cameron commented 1 year ago

Isn't it because comfy-table uses u16?

Your error says,

thread 'main' panicked at 'attempt to add with overflow', /Users/dadepo/.cargo/registry/src/github.com-1ecc6299db9ec823/comfy-table-6.1.4/src/utils/mod.rs:44:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/c18a5e8a5b1afb0d7a582fe9ebad4c1996c90da3/library/std/src/panicking.rs:575:5

and on the comfy-table, you can find https://github.com/Nukesor/comfy-table/blob/main/src/utils/mod.rs

    pub fn width(&self) -> u16 {
        self.content_width + self.padding.0 + self.padding.1
    }

I explored a way to fix this from arrow-datafusion. But relevant data is not exposed to the outside of the crate.

dadepo commented 1 year ago

Is having arrow-datafusion truncate the width, to a max value, that does not overflow an acceptable fix? Is that easily doable? If so, I'll say that is better than blowing up as it currently does.

thomas-k-cameron commented 1 year ago

I think the best way to do is to handle it on comfy-table's side. They are widely adopted and I'm pretty sure someone would end up with the same issue.

If we are going to do it on datafusion's side, few things concerns me.

  1. Performance: You don't really know how wide a column has to be until you go to the very end; which can turn out of be time-consuming.
  2. Comfy-table: Say we implement an algorithm to calculate the comfy-table's width, but you don't really know when they are going to change the implementation.

I'm not one of the maintainer so I appreciate advice.

alamb commented 1 year ago

I think this is likely something that needs to be fixed in arrow-rs or comfy-table.

I agree with the tradeoffs identified by @thomas-k-cameron but don't know enough about comfy-table and its maintenance status to know how easy it is to fix or how likely they would be to accept such a change.

Thus, my suggestion is:

  1. File a ticket in comfy table explaining the issue and see what, if any, response comes back
  2. File a ticket in arrow-rs about the panic on large width columns in display

Depending on the response you get from those two crates may help decide which approach to take.

alamb commented 1 year ago

That gives the idea to show only first N rows and make a param on that.

This is a neat idea @comphead -- what I recall psql doing is buffering the first 1000 rows or something and using that sample to determine column widths for the rest of the output, which is streamed.

I have always wanted to do something similar for datafusion-cli (and IOx) but I haven't gotten around to doing so

thomas-k-cameron commented 1 year ago

I created a ticket on comfy-table. They are pretty active so I should be able to hear from them.

thomas-k-cameron commented 1 year ago

This may be related to the fact that datafusion-cli is trying to write the entire result to an inmemory buffer 🤔

That gives the idea to show only first N rows and make a param on that.

This doesn't work because the overflow happens on calculating width. I think you can restrict the number letters you pass to each row though.

Nukesor commented 1 year ago

Hey :)

In comfy-table, I went with the approach to choose a u16 for table widths, since this covered all use-cases I could come up with. This allows the user to build tables that're 65535 characters wide.

If there's some content that's longer than 65535 chars, the table width is internally fixed to u16::MAX, but the formatting can look a bit wonky, since the actual content can be even longer than 65535 chars. However, I consider this a non-issue, since no user will ever have a screen wide enough to display that many chars and the output on your terminal will be botched anyway. In case a user wants proper formatting, there's the Dynamic content adjustment mode, which introduces line-breaks to fit the content to the current/specified terminal width.

I already tried to use the appropriate saturating_[sub|add] functions everywhere, but it seems I overlooked a few places. A PR https://github.com/Nukesor/comfy-table/pull/114 has been opened, but it would be awesome if somebdoy could test this or even write a regression test case to trigger the panic on the current code-base.

Nukesor commented 1 year ago

New release with a bunch of fixes has been published :)

https://github.com/Nukesor/comfy-table/releases/tag/v7.0.1 https://crates.io/crates/comfy-table/7.0.1