chmp / serde_arrow

Convert sequences of Rust objects to Arrow tables
MIT License
58 stars 17 forks source link

ARROW-8817 #183

Closed jmelo11 closed 2 months ago

jmelo11 commented 3 months ago

Hi guys,

Im trying to use this lib but i'm getting the following error:

not implemented: See ARROW-8817.

From what i've seen it has something to do with union types (Option). What can be done in this cases?

For reference, the struct that i'm trying to serialize is the following (most of the fields are simple enums, except for Date which is a wrapper arrround chrono, with a properer de/serializer):

#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct LoanDepoOutput {
    pub mis_id: String,
    pub reference_date: Date,
    pub loandepo_configuration_id: usize,
    pub notional: f64,
    pub issue_date: Option<Date>,
    pub start_date: Date,
    pub end_date: Date,
    pub credit_status: String,
    pub structure: Structure,
    pub side: Side,
    pub account_type: AccountType,
    pub segment: String,
    pub area: String,
    pub product_family: ProductFamily,
    pub payment_frequency: Frequency,

    pub first_ftp_rate: f64,
    pub first_client_rate: f64,
    pub second_ftp_rate: Option<f64>,
    pub second_client_rate: Option<f64>,

    // pre-calculated fields
    pub notional_local_ccy: Option<f64>,
    pub outstanding: Option<f64>,
    pub outstanding_local_ccy: Option<f64>,
    pub avg_outstanding: Option<f64>,
    pub avg_outstanding_local_ccy: Option<f64>,
    pub avg_readjustment: Option<f64>,
    pub avg_interest: Option<f64>,
    pub avg_interest_local_ccy: Option<f64>,
    pub ftp_interest: Option<f64>,
    pub ftp_interest_local_ccy: Option<f64>,
    pub earned_interest: Option<f64>,
    pub earned_interest_local_ccy: Option<f64>,
    pub margin: Option<f64>,
    pub margin_local_ccy: Option<f64>,

    pub rate_type: RateType,
    pub first_rate_frequency: Frequency,
    pub first_rate_day_counter: DayCounter,
    pub first_rate_compounding: Compounding,

    pub months_to_first_coupon: Option<i16>,
    pub second_rate_frequency: Option<Frequency>,
    pub second_rate_day_counter: Option<DayCounter>,
    pub second_rate_compounding: Option<Compounding>,

    pub currency: Currency,
    pub discount_curve_id: usize,
    pub forecast_curve_id: Option<usize>,

    pub cashflows: String,
    pub evaluation_mode: Option<EvaluationMode>,
    pub rate_change_date: Option<Date>,
    pub cashflows_source: String,
}

Regards,

chmp commented 3 months ago

Hi @jmelo11,

the error code ARROW-8817 does not originate from serde_arrow, but seems to be related to missing Union support in the parquet implementation of arrow (see here).

The Options fields are directly handled and should not be the cause of trouble. My guess is that you have other enums in your types, e.g., EvaluationMode. It depends a bit on the details of these enums, what the best course of action would be. Could you share the other types as well, at least the enums?

In particular, the question would be whether the enums carry additional data or only have unit variants. I.e., enum E { A(..), B { ..} } vs enum E { A, B }.

jmelo11 commented 3 months ago

Thanks for the anwser @chmp. Most of the enums contain only keys but some have values:

/// # EvaluationMode
/// Contains the information on how the evaluation should be executed.
#[derive(Debug, Serialize, Deserialize, Clone, Copy)]
pub enum EvaluationMode {
    FTPRate,
    ClientRate,
}

/// # Frequency
/// Enum representing a financial frequency.
#[derive(Serialize, Deserialize, Debug, PartialEq, Eq, Clone, Copy, PartialOrd, Ord, Hash)]
pub enum Frequency {
    NoFrequency = -1,
    Once = 0,
    Annual = 1,
    Semiannual = 2,
    EveryFourthMonth = 3,
    Quarterly = 4,
    Bimonthly = 6,
    Monthly = 12,
    EveryFourthWeek = 13,
    Biweekly = 26,
    Weekly = 52,
    Daily = 365,
    OtherFrequency = 999,
}

At the other hand, Date is just a wrapper as I said before:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct Date {
    base_date: NaiveDate,
}
chmp commented 2 months ago

Both enums should encode into arrays without data, e.g., into Union arrays with each child array having NULL type. Most likely this is not the encoding that you would like to have. #44 discussed a related feature. My guess would be that having the enum value encoded as a dictionary would be preferable.

E.g., currently the values [EvaluationMode::FTPRate, EvaluationMode::ClientRate, EvaluationMode::FTPRate] are represented as these arrays:

The encoding discussed in #44 would result in the equivalent of

So far, I did not start working on #44, though. If this would be relevant to you as well, I could prioritize this issue.

Motivation for the current encoding The reason for this complex encoding, is to allow enums similar to `enum E { A(i32), B(u8) }`. The values `[E::A(13), E::B(21), E::A(42)]` would be represented as: - type: `[0, 1, 0]` - offsets: `[0, 0, 1]` - values FTPRate: `[13, 42]` - values ClientRate: `[21]` Only be encoding the enum values separately can `serde_arrow` support enums with data.
chmp commented 2 months ago

Started work in #185

chmp commented 2 months ago

I implemented the option to serialize enums without data as strings. To use this option, you can either:

Example:

#[derive(Serialize, Deserialize)]
enum U {
    A,
    B,
    C,
}

let items = [Item(U::A), Item(U::B), Item(U::C), Item(U::A)];

let tracing_options = TracingOptions::default().enums_without_data_as_strings(true);
let fields = Vec::<FieldRef>::from_type::<Item<U>>(tracing_options)?;
let batch = serde_arrow::to_record_batch(&fields, &items)?;

Would this work for you?

chmp commented 2 months ago

Merged #185