GoogleCloudPlatform / training-data-analyst

Labs and demos for courses for GCP Training (http://cloud.google.com/training).
Apache License 2.0
7.65k stars 5.77k forks source link

"Serverless Data Processing with Dataflow - Writing an ETL Pipeline using Apache Beam and Cloud Dataflow (Python)" job fails because of short schema #2541

Open MrCsabaToth opened 4 months ago

MrCsabaToth commented 4 months ago

When following the instructions of https://www.cloudskillsboost.google/course_sessions/11591045/labs/433174 (part of 09 Serverless Data Processing with Dataflow: Develop Pipelines, `Data Engineer Learning Path > Serverless Data Processing with Dataflow: Develop Pipelines

Beam Concepts Review)

Task 5. Write to a sink cites a too short schema:

table_schema = {
        "fields": [
            {
                "name": "name",
                "type": "STRING"
            },
            {
                "name": "id",
                "type": "INTEGER",
                "mode": "REQUIRED"
            },
            {
                "name": "balance",
                "type": "FLOAT",
                "mode": "REQUIRED"
            }
        ]
    }

However if someone digs deep can see https://github.com/GoogleCloudPlatform/training-data-analyst/blob/989aa2d423f17647b20e2e02382b5d0f7b467193/quests/dataflow_python/batch_event_generator.py#L47 log_fields = ["ip", "user_id", "lat", "lng", "timestamp", "http_request", "http_response", "num_bytes", "user_agent"] and consequently the solution file has

    table_schema = {
        "fields": [
            {
                "name": "ip",
                "type": "STRING"
            },
            {
                "name": "user_id",
                "type": "STRING"
            },
            {
                "name": "lat",
                "type": "FLOAT"
            },
            {
                "name": "lng",
                "type": "FLOAT"
            },
            {
                "name": "timestamp",
                "type": "STRING"
            },
            {
                "name": "http_request",
                "type": "STRING"
            },
            {
                "name": "http_response",
                "type": "INTEGER"
            },
            {
                "name": "num_bytes",
                "type": "INTEGER"
            },
            {
                "name": "user_agent",
                "type": "STRING"
            }
        ]
    }

however without peeking into the solution the job fails. The instructions could be updates for better student success.