GDIATTA / GCP_Data_Engineering

GCP(Google Cloud Platform) is one of the leading cloud platforms worldwide. Therefore, it's essential for computer science professionals, especially data engineers, data analysts, and data scientists, to learn tools such as BigQuery, Data Flow, Cloud SQL, Bigtable, and more.
Apache License 2.0
1 stars 0 forks source link

GCP_Data_Engineering

Capture d’écran 2024-04-28 191003

Chapter1: -------------------------- BigQuery ---------------------------------:

BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time. With BigQuery, there's no infrastructure to set up or manage, letting you focus on finding meaningful insights using GoogleSQL and taking advantage of flexible pricing models across on-demand and flat-rate options.

There are several ways to ingest data into BigQuery:

Batch load a set of data records.
Stream individual records or batches of records.
Use queries to generate new data and append or overwrite the results to a table.
Use a third-party application or service.

---------------------------- Batch loading data ------------------------------:

With batch loading, you load the source data into a BigQuery table in a single batch operation. For example, the data source could be a CSV file, an external database, or a set of log files. Traditional extract, transform, and load (ETL) jobs fall into this category.

Get started :

  1. Open the BigQuery page in the Google Cloud console(Go to the BigQuery page).

  2. Once BigQuery opened, in Explorer, select on the ID project and then choose create Dataset.
    Capture d’écran 2024-08-01 115931

  3. On you right, a Windows opened to prompt to fullfill, as follows:

    In ID project, give a name for ID Dataset.
    In Location, choose Region or Multi-region and select the region that you want to store your Dataset.
    In Schedule, set the number of days of your table if you activate the expiration of your table.
    Next, click on "create Dataset".
    Capture d’écran 2024-08-01 121214

  4. Once the Dataset created, find it to click on to create a table as follow :

    In source, choose Google Cloud Storage from the options(empty table, Google Cloud Storage, Import, Drive, Google Bigtable, Amazon S3, Azure Blob Storage).
    Browse to select the source file and choose the file format.
    In the Destination section, select the project ID, the dataset that contains the table, and set the table name.
    In the Schema section, select "Automatically detect."
    Leave other options as default and click on "Create Table."
    Capture d’écran 2024-08-01 122547

  5. Once the table is created, click on it to open, make a query, or create a notebook.
    Capture d’écran 2024-08-01 123513 Capture d’écran 2024-08-01 160317

---------------------------- Use a third-party application or service -----------------------------------:

Get started :

  1. Open your IDE, Vs Code is used for this tutorial and then create workspace folder.

  2. Check if you have virtualenv. Else you will set up it.
    Capture d’écran 2024-08-01 144538

  3. Set up your virtual environment by using virtualenv venv and then activate it.

  4. Set up the library google cloud bigquery by using as follow on this picture.
    Capture d’écran 2024-08-01 150232

  5. Create a python script to ingest data into bigquery, as follow:
    Capture d’écran 2024-08-01 150426 Capture d’écran 2024-08-01 152136

------------------------ Stream individual records or batches of records -----------------------------:

Get started :

  1. On your left, click on Data Transfer, then select create a transfer and enable API Data Transfer.

    Capture d’écran 2024-08-06 163254 Capture d’écran 2024-08-06 163404

  2. For Type of source, choose the desired source.In this tutorial, choose Google Cloud Storage.

    Capture d’écran 2024-08-06 163601

  3. In Transfer Setting Name, enter a name for the transfer.

  4. For Freqency of Data Collection, choose On Demande.

  5. In Destination configuration, choose a Dataset.

    Capture d’écran 2024-08-06 164951

  6. In Data Source Details,follow these steps :

    Set a Destination table.
    For Cloud Storage URI, browser your directory and set the path that contains all files to transfer.
    For Write preference, choose one of the optional (APPEND, MIRROR).
    For File format, select the appropriate file format.

Capture d’écran 2024-08-06 165031

  1. In Format type, complete the following :

    Set the field delimiter.
    Set the Header Rows to skip.

Capture d’écran 2024-08-06 165111

  1. Leave other options as default and click on Save.

Capture d’écran 2024-08-06 165137

  1. On the top right, select execute transfer.

Capture d’écran 2024-08-06 165528 Capture d’écran 2024-08-06 165618 Capture d’écran 2024-08-06 170026

In BigQuery, we can create a programm request and repository.

Capture d’écran 2024-08-06 170348 Capture d’écran 2024-08-06 170436

Chapter2: --------------- Data Transformation in BigQuery--------------------------:

This tutorial describes the different ways you can transform data in your BigQuery tables.

Methods of transforming data

You can transform data in BigQuery in the following ways:

Use data manipulation language(DML) to transform data in your BigQuery tables
Use Dataform to develop, test, control versions and Schedule SQL workflows in BigQuery.

  1. Transform data with DML
    You can use data manipulation language (DML) to transform data in your BigQuery tables. DML statements are GoogleSQL queries that manipulate existing table data to add or delete rows, modify data in existing rows, or merge data with values from another table. DML transformations are also supported in partitioned tables.

You can run multiple DML statements concurrently, where BigQuery queues several DML statements that transform your data one after the other. BigQuery manages how concurrent DML statements are run, based upon the transformation type.

For practice with this DML, you can use the scripts in the BigQuery folder of this repository. However, to use them as I do, you'll need to generate a credentials file configured with permissions to edit, update, create, and delete in BigQuery. This do itself as follow:

Open the IAM and Admin an select Service account. Click on create a service account.

Capture d’écran 2024-08-08 135139

In the service account details window:

Set a name for the service account. Provide a description(optional), then click Create and continue.

Capture d’écran 2024-08-08 140301

In Grant this service account access to the project section, add a relevant role and click Continue.

Capture d’écran 2024-08-08 140457

Leave the remaining settings as they are(optional), and click Ok.

Capture d’écran 2024-08-08 140536

In the next window, on the right click to Create a key.

Capture d’écran 2024-08-08 140630

Choose Add a key and then choose Create a new key and Create to generate the key.

Capture d’écran 2024-08-08 140721 Capture d’écran 2024-08-08 140757 Capture d’écran 2024-08-08 140829