firebase / extensions

Source code for official Firebase extensions
https://firebase.google.com/products/extensions
Apache License 2.0
892 stars 381 forks source link

Firestore BigQuery Export - Document properties exported to individual fields #113

Closed madmacc closed 4 years ago

madmacc commented 4 years ago

[REQUIRED] Step 2: Extension name

firestore-bigquery-export

What feature would you like to see?

Currently all document properties are exported to single string field/column called "data" as a JSON string in the members_raw_changelog or members_raw_latest tables. I am assuming this is by design and not just a bug in my case. This data is not particularly usable, I would need to schedule a job to extract and manipulate it in another table.

If I "Transfer" data from a firestore export/backup file it is imported with each document property into its own field/column and is usable right away. This is what I was expecting with the extension.

So my request is to have each imported document have its properties in their own field/column so they can be queried by SQL.

How would you use it?

Well at the moment the extension data is unusable. Say I want to use some data in a data visualization tool. This is not possible without a lot of manipulation. With the requested change this would work out of the box.

I have attached an image with an example of the data and another image with the way it should look. The 2nd image is from data transferred from a command line export.

I am actually hoping this is a bug and it will be fixed shortly!

BigQuery-Issue BigQuery-Issue2

JonesM87 commented 4 years ago

I would guess the data has to be in this format to get it into BiqQuery? But I can't confirm. There is a way to get a more usable output though, using the json functions in a BiqQuery you can extract into a more digestible format for your BI tool:

e.g. SELECT JSON_EXTRACT_SCALAR(data, '$.yourfirebasedocumentfield') yourfirebasedocumentfield FROM 'yourfirebaseproject.datasetId.yourfirebasecollection_raw_changelog';

https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions

jmwski commented 4 years ago

We are working on documenting a solution that transforms Firestore schemas into SQL queries like the ones posted by @JonesM87. These queries will leverage bigquery's json parsing functions to generate richly-typed views your raw collection data in BigQuery.

The source of this script that lives here, but you can also install it with npm (or npx) https://www.npmjs.com/package/@firebaseextensions/fs-bq-schema-views.

As I mentioned earlier, we're still working on documentation for this, but you can try it out with something like:

npx @firebaseextensions/fs-bq-schema-views \
  --non-interactive \
  --project <your-gcp-project> \
  --dataset <bigquery-dataset-with-raw-changelog> \
  --table-name-prefix <view-prefix> \
  --schema-files <path-to-firestore-schema>

Here, <path-to-firstore-schema> would be a path to a file on your local filesystem containing a Firestore schema with the fields you'd like to be present in the resulting view. A contrived example of such a schema is:

{
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "favorite_numbers",
      "type": "array"
    },
    {
      "name": "last_login",
      "type": "timestamp"
    },
    {
      "name": "last_location",
      "type": "geopoint"
    },
    {
      "fields": [
        {
          "name": "name",
          "type": "string"
        }
      ],
      "name": "friends",
      "type": "map"
    }
  ]
}

Each name should correspond to some data that appears in your Firestore document, and each type is a valid Firestore type. After running the schema-view script, you'll see a view in the same dataset called <table-name-prefix>_<schema-file-name> . This view has a column for each named field you specified in your schema, and the contents of each column are the typed values for those fields that appear in your mirrored Firestore documents. One important caveat is that if you specify a field that doesn't appear in a document, the resulting cell in the view will be NULL.

In any case, once you have this view, you can query it with SQL. We are still working on documenting this tool comprehensively, but feel free to try it out!

madmacc commented 4 years ago

Thanks @JonesM87 and thanks @IanWyszynski for the very detailed response. I will give it try with the suggested methods.

madmacc commented 4 years ago

@IanWyszynski I am having the following error when trying to run the script: Access Denied: Project myprojectname: User does not have bigquery.jobs.create permission in project myprojectname

I am running the script as follows: npx @firebaseextensions/fs-bq-schema-views --non-interactive --project "myprojectname" --dataset "datasetname" --table-name-prefix "members" --schema-files "bigquery-schema-members.json"

I have tried adding BigQuery.Admin and BigQuery.JobUser permissions to the following accounts: owner account, BigQuery Connection Service Agent, App Engine default service account

Apologies if it is obvious as I am relatively new to this.

rphlmr commented 4 years ago

@IanWyszynski Thank you for this script !

It work like a charm :)

@madmacc To deploy it, I installed the gcloud sdk on my mac and i used my admin account.

$ gcloud config set account my@email.com  
$ gcloud auth application-default login 
madmacc commented 4 years ago

@rphlmr Thanks for that. I tried:

$ gcloud config set account my@email.com  
$ gcloud auth application-default login

Warning here: The environment variable [GOOGLE_APPLICATION_CREDENTIALS] is set to: [R:\myproject\service-account.json] Credentials will still be generated to the default location: [C:\Users\username\AppData\Roaming\gcloud\application_default_credentia ls.json] To use these credentials, unset this environment variable before running your application.

$ set GOOGLE_APPLICATION_CREDENTIALS="C:\Users\username\AppData\Roaming\gcloud\application_default_credentials.json" 
$ gcloud auth list

Here it returns my admin/owner account email.

I still get the same issue. I am pretty sure I was using my admin/owner account previously just the credentials json file was in a different location.

rphlmr commented 4 years ago

@madmacc hum you are on Windows, maybe there is a different behaviour. When you set new env variable, you have to quit your console and open a new one (to reload the env).

In this case the warning said to unset the env var if you want to use your account.

I'll do :

set GOOGLE_APPLICATION_CREDENTIALS=

close my terminal and reopen.

verify that the env var is clear with

echo %GOOGLE_APPLICATION_CREDENTIALS%
(should print " %GOOGLE_APPLICATION_CREDENTIALS%")

and retry

$ gcloud config set account my@email.com  
$ gcloud auth application-default login

When you do that, you didn't use service account-js. A browser should open and ask for google signin.

I'm not 100% sure of that :s

madmacc commented 4 years ago

Appreciate your help @rphlmr. I have it working now. I used the below to reset it and then I ran it interactively this time. export GOOGLE_APPLICATION_CREDENTIALS=

The only thing is it has quotes in the string fields otherwise looking all good. bigquery-view

rphlmr commented 4 years ago

@madmacc Yes quotes are the expected behaviour (I think). It's due to the sql query that build this view.

I changed the query on my project : Select your view details tab ==> edit this request

Replace "JSON_EXTRACT" by "JSON_EXTRACT_SCALAR".

NB : I did this replacement only on columns that are not typed by bigquery

exemple :

SELECT
....
`my_project.all_transactions_export.firestoreTimestamp`(JSON_EXTRACT(data, '$._createdAt')) AS _createdAt,
 JSON_EXTRACT_SCALAR(data, '$.transactionId') AS transactionId
....

Then, save your view (you can override the existing view).

madmacc commented 4 years ago

@rphlmr Can I ask if you have managed to run the script to import historical data successfully? npx @firebaseextensions/fs-bq-import-collection

I am having issues with that: https://github.com/firebase/extensions/issues/80

rphlmr commented 4 years ago

@madmacc

I just run npx @firebaseextensions/fs-bq-import-collection and answer questions it asked. (I have to said that I retry 3 times to find what to answer ^^).

After that, if your bigquery export cloud function is already in production, you'll have to deal with duplicated datas ;)

If someone is interested, to drop duplicated datas I create an other view :

SELECT
  _createdAt as createdAt,
  transactionId,
  sellerName,
  ... etc ...
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY document_name) row_number
  FROM
    `my_project.data_set.my_collection_exported_schema_changelog`)
WHERE
  row_number = 1
madmacc commented 4 years ago

@rphlmr Ok, well useful to hear it works. Maybe it is just an issue with my environment or a Windows issue. I will set up an old Mac I have and give it a try on that.

rphlmr commented 4 years ago

To import this is what I answered :

➜ npx @firebaseextensions/fs-bq-import-collection
? What is your Firebase project ID? my_project_id
? What is the path of the the Cloud Firestore Collection you would like to import from? (This may, or may not, be the same Collection for which you plan to mirror changes.) my_collection_name
? What is the ID of the BigQuery dataset that you would like to use? (A dataset will be created if it doesn't already exist) firestore_export
? What is the identifying prefix of the BigQuery table that you would like to import to? (A table will be created if one doesn't already exist) my_collection_name
? How many documents should the import stream into BigQuery at once? 300
Importing data from Cloud Firestore Collection: my_collection_name, to BigQuery Dataset: firestore_export, Table: my_collection_name_raw_changelog
BigQuery dataset already exists: firestore_export
BigQuery table with name my_collection_name_raw_changelog already exists in dataset firestore_export!
View with id my_collection_name_raw_latest already exists in dataset firestore_export.
Inserting 300 row(s) of data into BigQuery
Inserted 300 row(s) of data into BigQuery
Inserting 300 row(s) of data into BigQuery
Inserted 300 row(s) of data into BigQuery
Inserting 300 row(s) of data into BigQuery
Inserted 300 row(s) of data into BigQuery
Inserting 210 row(s) of data into BigQuery
Inserted 210 row(s) of data into BigQuery
---------------------------------------------------------
Finished importing 1110 Firestore rows to BigQuery
---------------------------------------------------------
colinrw commented 4 years ago

Is it possible to import from multiple Collections at once? Or do I have to run this for all my Collections? @rphlmr

rphlmr commented 4 years ago

@colinrw I don't know, but we do : 1 collection = 1 bigquery extension. We think that it's more clear for us :)

colinrw commented 4 years ago

@rphlmr but what if I need to get data from multiple collections? Do I have to make multiple extensions for each collection?

rphlmr commented 4 years ago

@colinrw yes, when you configure a bigquery extension, it allows only one collection at time. Under the hood, a bigquery extension is just a cloud function that responds to the trigger "onWrite" on your collection.

laurenzlong commented 4 years ago

@fredzqm and @IanWyszynski can you comment on the feasibility of this?

fredzqm commented 4 years ago
Is it possible to import from multiple Collections at once? Or do I have to run this for all my Collections?

Yes, just create two extension writing to the same DATASET_ID and TABLE_ID.

It will just work. (The nice part about schema-less export format.)

laurenzlong commented 4 years ago

Thanks for confirming! I'm leaving this issue open to track documenting the script @IanWyszynski outlined in https://github.com/firebase/extensions/issues/113#issuecomment-555679921. When you get a chance, it would be great if you can make a PR to add it.

larssn commented 4 years ago

Cool script, works like a charm (I'm brand new at this).

I have a pretty complex object, and was wondering if the type array, supports objects (records)?

Here's the gist of it:

export class Order {
    /// The order items. Also called `order lines` in some systems.
    orderItems: OrderItem[];

    /// A list of payments since multiple payments can be made per order.
    payments: Payment[];
}

export class OrderItem {
    /// The name from the product.
    name: string;

    /// The quantity of the product
    quantity: number;
}

Is it possible to query a single OrderItem on a given Order, by name, f.ex?

rphlmr commented 4 years ago

Hi @larssn In your case i would probably have a collection for Orders, OderItems ans Payments. In orders i will store orderItem ids and payment ids (data normalization). It's more easy to make queries, manage state in front end etc.

With this example, in bigquery, you will have a json "stringifyed" in fields 'orderItems' and 'payments'. Bigquery is just normal sql wich knows primitives types.

larssn commented 4 years ago

@rphlmr I figured out a bit. I modified the array function to be the following:

CREATE OR REPLACE FUNCTION `staging_database.firestore_export.firestoreArray`(json STRING) RETURNS ARRAY<STRUCT<name STRING, quantity NUMERIC>> LANGUAGE js AS """
return json ? JSON.parse(json).map(x => ({name: x.name, quantity: x.quantity})) : [];
""";

That seems to allow me to query order items by name or quantity.

The query looks fairly standard:

SELECT
  *
FROM
  (
    SELECT
      document_name,
      operation,
      `staging_database.firestore_export.firestoreTimestamp`(JSON_EXTRACT_SCALAR(data, '$.created')) AS created,
      `staging_database.firestore_export.firestoreTimestamp`(JSON_EXTRACT_SCALAR(data, '$.modified')) AS modified,
      `staging_database.firestore_export.firestoreArray`(JSON_EXTRACT(data, '$.orderItems')) AS orderItems
    FROM
      `staging_database.firestore_export.orders_raw_latest`
  ) orders_raw_latest
  CROSS JOIN UNNEST(orders_raw_latest.orderItems)
WHERE name = 'Test product 2'
AND quantity = 2

Do you know how all the JSON_EXTRACT operations perform?

EDIT: This answer seems to indicate that it isn't a problem: https://stackoverflow.com/a/56450196/536485

rphlmr commented 4 years ago

Nice :p

I don't know about performance of JSON_EXTRACT but that what the script do to produce its view. So it seems fine for me.

karayu commented 4 years ago

The detailed instructions for the script has now been merged in #161 . I'm closing this now. Please reopen if there's any other issues!