apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

Render tables using html in notebooks. #713

Open timsaucer opened 1 month ago

timsaucer commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Many users, especially those who want to try out DataFusion for the first time, will use notebooks, either Jupyter, Databricks, or others. It would be a nice feature to have dataframes shown in these notebooks rendered using html like some other dataframe libraries.

Describe the solution you'd like

In order to do this, we need to implement _repr_html_ on the PyDataFrame object. This can operate in the same manner as show() and limit the output to a few lines. Additional enhancements could include setting config parameters for how much data to show.

Describe alternatives you've considered

The other alternative is to continue to use show() to inspect the data. Users can output the dataframe to pandas and then use it's rendering capability.

Additional context

Here is a minimal demonstrable version we could start with in PyDataFrame

    fn _repr_html_(&self, py: Python) -> PyResult<String> {
        let mut html_str = "<table border='1'>\n".to_string();

        let df = self.df.as_ref().clone().limit(0, Some(10))?;
        let batches = wait_for_future(py, df.collect())?;

        if batches.is_empty() {
            html_str.push_str("</table>\n");
            return Ok(html_str);
        }

        let schema = batches[0].schema();

        let mut header = Vec::new();
        for field in schema.fields() {
            header.push(format!("<th>{}</td>", field.name()));
        }
        let header_str = header.join("");
        html_str.push_str(&format!("<tr>{}</tr>\n", header_str));

        for batch in batches {
            let formatters = batch
                .columns()
                .iter()
                .map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default()))
                .map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string()))))
                .collect::<Result<Vec<_>, _>>()?;

            for row in 0..batch.num_rows() {
                let mut cells = Vec::new();
                for formatter in &formatters {
                    cells.push(format!("<td>{}</td>", formatter.value(row)));
                }
                let row_str = cells.join("");
                html_str.push_str(&format!("<tr>{}</tr>\n", row_str));
            }
        }

        html_str.push_str("</table>\n");

        Ok(html_str)
    }

This produces the following example: Screenshot 2024-05-22 at 3 02 07 PM

timsaucer commented 1 month ago

The example above is a very simple approach and I think could add some immediate value. Even better would be to do something like pandas where we have a Styler class that allows for nuanced and expressive displays.

https://pandas.pydata.org/docs/user_guide/style.html

https://github.com/pandas-dev/pandas/blob/main/pandas/io/formats/style.py

I don't think we necessarily need to support all of the output formats they do, but it would be nice at least to give users some formatting ability on their tables. These are some of the features I think we need to gain wider adoption.

timsaucer commented 1 month ago

A follow on question: If we were to build a styler to output things like html (or latex, etc) does it make sense to do so in the datafusion-python repo to push it up into the datafusion repo?

Michael-J-Ward commented 1 month ago

Rounding out options.

I recently came across this python library dedicated to creating nicely formatted html tables great-tables. It currently works with polars and pandas, so any datafusion user today could call df.to_polars() or df.to_pandas() and then use it.

Of course, the conversion feels clunky, so if we went this route, we could explore adding support for datafusion tables upstream.

Again, just rounding out options. I don't have any strong thoughts on this feature request.