Module 1, Activity 1: nyc_trips csv and parquet tables differ

emrspecialistsamer / aws-glue-workshop

Repository for AWS Glue Workshop

Apache License 2.0

30 stars 29 forks source link

Module 1, Activity 1: nyc_trips csv and parquet tables differ #4

Closed eebomarsi closed 2 years ago

eebomarsi commented 2 years ago

I ran the module 1 csv to parquet exercise without any issues. The crawlers were successful, schemas matched, and no exceptions. However, I checked the row counts and see the parquet table has 10 rows less that the source csv table.

spark.sql("select count(*) from nyc_trips_parquet").show()
+--------+
|count(1)|
+--------+
|25955596|
+--------+

spark.sql("select count(*) from nyc_trips_csv").show()
+--------+
|count(1)|
+--------+
|25955606|
+--------+

I re-ran the entire workshop from scratch and had the same results. I'll compare later and figure out the difference, but just wanted to note this in case anyone has already analyzed it.

eebomarsi commented 2 years ago

I double checked with Athena which says they match: SELECT count() FROM "nyc_trips"."nyc_trips_csv"; 25955596 SELECT count() FROM "nyc_trips"."nyc_trips_parquet"; 25955596

So, the question is why does spark sql show 10 more rows in nyc_trips_csv

eebomarsi commented 2 years ago

Checking for unmatched rows via Athena and Glue spark.sql results in no rows:

SELECT vendor_name, trip_pickup_datetime, trip_dropoff_datetime FROM "nyc_trips"."nyc_trips_csv" EXCEPT SELECT vendor_name, trip_pickup_datetime, trip_dropoff_datetime FROM "nyc_trips"."nyc_trips_parquet"
SELECT vendor_name, trip_pickup_datetime, trip_dropoff_datetime FROM "nyc_trips"."nyc_trips_parquet" EXCEPT SELECT vendor_name, trip_pickup_datetime, trip_dropoff_datetime FROM "nyc_trips"."nyc_trips_csv"

eebomarsi commented 2 years ago

The reason for the discrepancy is because the spark sql query is including the csv header row for each of the 10 input files. By filtering header rows, I get the same counts:

spark.sql("select count(*) from nyc_trips_csv").show()
25955606

spark.sql("select count(*) from nyc_trips_csv where vendor_name != 'vendor_name'").show()
25955596

spark.sql("select count(*) from nyc_trips_parquet").show()
25955596

Reference

eebomarsi commented 2 years ago

Likewise, using the DataFrame header option:

df_csv = spark.read.csv("s3://ebo-glue-labs-1108-724598390090/data/nyc_trips_csv/")
df_csv.count()
25955606

df_csv = spark.read.option("header","true").csv("s3://ebo-glue-labs-1108-724598390090/data/nyc_trips_csv/")
df_csv.count()
25955596

vivshrivastava commented 2 years ago

yeah by default head is false, so the workshop should have the header turned on in the statement.