cube-js / cube

📊 Cube — The Semantic Layer for Building Data Applications
https://cube.dev
Other
17.71k stars 1.75k forks source link

BigQuery driver seems does not cast all numerical values equally to strings #1835

Open philippefutureboy opened 3 years ago

philippefutureboy commented 3 years ago

EDIT: I realize that I am two minor versions behind, so I will update to 0.25 to check if the issue is solved EDIT 2: I confirm the bug is still present on version 0.25.21.

Describe the bug

Six days ago, we upgraded our version of the @cubejs-backend/bigquery-driver from "0.23.8" to "0.23.14" and yesterday we stumbled on a type cast error for the results of the query which caused our following code to fail due to a type assumption.

Following a discussion on slack with @paveltiunov, turns out that Cube is transitioning towards returning all types as string for simplicity and for supporting large numbers.

The problem here is twofolds:

  1. This is a fairly major change which can introduce breakage in production systems, and thus should at least be a minor patch, and I'd even go as far as say as part of a major patch. If the team chooses to use a different approach to versioning than the standard semver, it would be important to express it somewhere that cannot be missed.
  2. The feature does not seem to be working reliably; two sum typed measures resulted in different types - one a string and one a float.

To Reproduce Steps to reproduce the behavior:

Given a BigQuery view with two sum typed measures:

SELECT 
  `schedule_track`.clinic_key `schedule_track__clinic_key`, 
  CAST(`calendar`.date_key AS TIMESTAMP) `calendar__date_key`, 
  sum(
    `schedule_track_facts`.amount_cad
  ) `schedule_track_facts__charges_amount`, 
  sum(
    `schedule_track_objective_calendar`.charge_per_open_time_slot_hour_objective * `schedule_track_objective_calendar`._open_time_slot_hours_value
  ) `schedule_track_objective_calendar__charges_amount_objective` 
FROM ...

the result returned is like the following:

{
      "schedule_track.clinic_key": "cli91dcb088415e4d2d8bd058655d2",
      "calendar.date_key": "2021-01-06T00:00:00.000",
      "schedule_track_facts.charges_amount": "8796.5",
      "schedule_track_objective_calendar.charges_amount_objective": 8534.475
},

where "schedule_track_facts.charges_amount" should be a float or "schedule_track_objective_calendar.charges_amount_objective" a string, but not both different types.

Schema:

cube(`schedule_track_facts`, {
  sql: `
    SELECT
        schedule_track_calendar_view.clinic_key,
        schedule_track_calendar_view.date_key,
        schedule_track_calendar_view.schedule_track_key,
        schedule_track_facts_keyview.* EXCEPT(clinic_key, date_key, schedule_track_key)
    FROM
        \`${project}\`.${dataset_key}.schedule_track_calendar_view
        LEFT JOIN
        \`${project}\`.${dataset_key}.schedule_track_facts_keyview
            ON schedule_track_calendar_view.clinic_key = schedule_track_facts_keyview.clinic_key
            AND schedule_track_calendar_view.date_key = schedule_track_facts_keyview.date_key
            AND schedule_track_calendar_view.schedule_track_key = schedule_track_facts_keyview.schedule_track_key
  `,
  // ...
  measures: {
    // ...
    charges_amount: {
      sql: `${CUBE}.amount_cad`,
      type: `sum`,
    },
  },
})

Expected behavior

All FLOAT64, INT64, NUMERIC types should be returned as the same type in the resulting records, be it string or number. It is very possible that the problem is that amount_cad is NUMERIC NULLABLE whereas

`schedule_track_objective_calendar`.charge_per_open_time_slot_hour_objective * `schedule_track_objective_calendar`._open_time_slot_hours_value

may be implicitly returned as FLOAT64 by the BQ engine due to _open_time_slot_hours_value being implicitly FLOAT NULLABLE.

Learning on our side

Moving forward we will cast explicitly all of our types coming out of Cube.js (or any other service) to the expected type so that we can handle them properly in pandas, and catch unexpected type conversions earlier in the code.

philippefutureboy commented 3 years ago

Given that you are planning to move towards standardization of your data exports as Map<string, string>, could you also consider adding the complete type information for the returned data? Right now in the annotation.measures[key].type and annotation.dimensions[key].type you seem to have string, number and time, which leave some precious information out if I want to dynamically cast that data. Could you add a property like native_type which would include the type of the field from the driver itself? Then I could parse a float differently than an int, than a bool, etc.

hassankhan commented 3 years ago

Hi @philippefutureboy :wave:

I've created a "parent" issue to track this and other similar issues. Thank you for filing the issue 🙌