deephaven / deephaven-core

Deephaven Community Core
Other
255 stars 81 forks source link

`import_table` from Python client (and likely other clients) should widen acceptable input type #5484

Open alexpeters1208 opened 4 months ago

alexpeters1208 commented 4 months ago

Currently, the Python client can be used to import data read from a Pandas dataframe. However, Session.import_table() only accepts Pyarrow tables, so the following workaround is required:

from pandas import read_csv
import pyarrow
crypto = client_session.import_table(pyarrow.Table.from_pandas(read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/FakeCryptoTrades_20230209.csv"
)))

It would be nice to be able to do this directly:

from pandas import read_csv
crypto = client_session.import_table(read_csv(
    "https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/FakeCryptoTrades_20230209.csv"
))

The R client accomplishes this by checking the type of the input table, converting it to an arrow table if possible, and calling a single method that converts an arrow RecordBatchReader to a DH table at the end:

    import_table = function(table_object) {
      table_object_class <- class(table_object)
      if (table_object_class[[1]] == "data.frame") {
        return(TableHandle$new(private$df_to_dh_table(table_object)))
      } else if (table_object_class[[1]] == "tbl_df") {
        return(TableHandle$new(private$tibble_to_dh_table(table_object)))
      } else if (table_object_class[[1]] == "RecordBatchReader") {
        return(TableHandle$new(private$rbr_to_dh_table(table_object)))
      } else if ((length(table_object_class) == 4 &&
        table_object_class[[1]] == "Table" &&
        table_object_class[[3]] == "ArrowObject")) {
        return(TableHandle$new(private$arrow_to_dh_table(table_object)))
      } else {
        stop(paste0("'table_object' must be a single data frame, tibble, arrow table, or record batch reader. Got an object of class ", table_object_class[[1]], "."))
      }
    },

And here's the methods that are being called:

    rbr_to_dh_table = function(rbr) {
      ptr <- self$.internal_rcpp_object$new_arrow_array_stream_ptr()
      rbr$export_to_c(ptr)
      return(self$.internal_rcpp_object$new_table_from_arrow_array_stream_ptr(ptr))
    },
    arrow_to_dh_table = function(arrow_tbl) {
      rbr <- as_record_batch_reader(arrow_tbl)
      return(private$rbr_to_dh_table(rbr))
    },
    tibble_to_dh_table = function(tibbl) {
      arrow_tbl <- arrow_table(tibbl)
      return(private$arrow_to_dh_table(arrow_tbl))
    },
    df_to_dh_table = function(data_frame) {
      arrow_tbl <- arrow_table(data_frame)
      return(private$arrow_to_dh_table(arrow_tbl))
    }

Different clients may have different idiomatic solutions for this - the R OOP framework that was used does not support method overloading, so this workaround was the best way.

rcaudy commented 4 months ago

I kind of prefer this the way it is. The client really only works with arrow. If we're going to add a pandas dependency for the client, it needs to be optional.

alexpeters1208 commented 4 months ago

I kind of prefer this the way it is. The client really only works with arrow. If we're going to add a pandas dependency for the client, it needs to be optional.

It may be possible to check for a pandas-typed input without importing or depending on pandas. After that, only pyarrow is needed to make the conversion.