GoogleCloudPlatform / DataflowTemplates

Cloud Dataflow Google-provided templates for solving in-Cloud data tasks
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
Apache License 2.0
1.16k stars 976 forks source link

[Bug]: The Foreign Keys are getting broken down to their component pieces for MongoDB to BigQuery #911

Open JoshAubrey opened 1 year ago

JoshAubrey commented 1 year ago

Related Template(s)

MongoDB to BigQuery

What happened?

The Reference Ids (aka Foreign Keys), which are the same format as the primary MongoId / ObjectId / _id, are getting broken down into their individual component pieces.

Input accountId: 63e131fb63ccf7e561f96920

Output to source_data in BigQuery: "accountId":{"timestamp":1675702779,"counter":16345376,"randomValue1":6540535,"randomValue2":-6815},"

I figured out a temporary workaround with a GoogleSQL javascript UDF, but handling this in the template library will be much better. Thank you!

Beam Version

Newer than 2.46.0

Relevant log output

No response

bvolpato commented 1 year ago

@theshanbhag Can you please take a look / triage this?

JoshAubrey commented 1 year ago

Here's an example of what I have been doing and how I worked around it for the moment if this helps:

  1. Reusable function for rebuilding the mongoIds
  2. De-duplicating the data in the table loaded from mongoDB
  3. Transforming the raw JSON data into a table for use in Google / Looker Data Studio using 1. above.

# rebuild MongoIds from their component pieces
CREATE TEMP FUNCTION
  convertId(input JSON)
  RETURNS STRING
  LANGUAGE js AS r"""

function getId(mongoId) {
  if (mongoId === null) {return null}

    const result = 
        DecimalHexTwosComplement(mongoId?.timestamp) +
        DecimalHexTwosComplement(mongoId?.randomValue1) +
        DecimalHexTwosComplement(mongoId?.randomValue2) +
        DecimalHexTwosComplement(mongoId?.counter);

    return result;
}

function DecimalHexTwosComplement(decimal) {

  if (decimal >= 0) {
    const hexadecimal = decimal.toString(16);

    return hexadecimal;
  } else {
    const hexadecimal = Math.abs(decimal).toString(16);

    let output = '';
    for (i = 0; i < hexadecimal.length; i++) {
      output += (0x0F - parseInt(hexadecimal[i], 16)).toString(16);
    }

    output = (0x01 + parseInt(output, 16)).toString(16);
    return output;
  }
}

return getId(input)

""";

  # Transactions de-deuplicate data
CREATE OR REPLACE TABLE
  `example-data.mongodb.transactions` AS (
  SELECT
    * EXCEPT(row_num)
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC ) row_num
    FROM
      `example-data.mongodb.transactions`) t
  WHERE
    row_num=1 );

  # Transactions Parse source_data and create transformed for use in data studio
CREATE OR REPLACE TABLE
  `example-data.mongodb.transactions-transformed` AS(
  SELECT
    timestamp,
    id AS id,
    SAFE_CAST(JSON_EXTRACT(source_data, "$.amount") AS DECIMAL) AS amount,
    SAFE_CAST(JSON_EXTRACT(source_data, "$.fee") AS DECIMAL) AS fee,
    SAFE_CAST(JSON_EXTRACT(source_data, "$.status") AS INT) AS status,
    SAFE_CAST(JSON_EXTRACT(source_data, "$.type") AS INT) AS type,
    SAFE_CAST(JSON_EXTRACT(source_data, "$.data.paymentMethod.type") AS INT) AS paymentType,
    convertId(PARSE_JSON(JSON_QUERY(source_data, "$.fid"))) AS accountId,
  FROM
    `example-data.mongodb.transactions`);
github-actions[bot] commented 6 months ago

This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.

JoshAubrey commented 6 months ago

Bump

Marcello09 commented 5 months ago

Although it is not a definitive solution, I've created a UDF that recursively transforms mongo objectid into strings. Here the code for my UDF. Hope it helps...

function parseOid(obj) {
    if (obj instanceof Array) {
        for(var i = 0; i < obj.length; i++) {
            obj[i] = parseOid(obj[i]);
        }
    } else {
        for(var prop in obj) {
           if (prop == "$oid") {
                obj = obj.$oid
                return obj
           }
           if (obj[prop] instanceof Object || obj[prop] instanceof Array) {
                obj[prop] = parseOid(obj[prop])
           }
        }
    }
    return obj
}

function process(inJson) {
    var obj = JSON.parse(inJson)
    obj = parseOid(obj)
    return JSON.stringify(obj)
}