EMCECS / presto-s3-connector

Apache License 2.0
8 stars 3 forks source link

S3 connect - how to handle bad/incomplete/incorrect data records #8

Open chipmaurer opened 3 years ago

chipmaurer commented 3 years ago

First, check hive connector to see what it does with bogus data, and do similar. Do things like have a CSV with blank rows, missing fields, incorrect types for etc.

chipmaurer commented 3 years ago

Here is a row problem that needs to be addressed.

s94,Movie,27: Gone Too Soon,Simon Napier-Bell,"Janis Joplin, Jimi Hendrix, Amy Winehouse, Jim Morrison, Kurt Cobain",United Kingdom,1-May-18,2017,TV-MA,70 min,Documentaries,"Explore the circumstances surrounding the tragic deaths at 27 of Jimi Hendrix, Jim Morrison, Brian Jones, Janis Joplin, Kurt Cobain and Amy Winehouse."

In a CSV which has a cell with a quoted comma list, the S3 column decoder gets confused, and you could end up with this error:

Query 20210921_190308_00129_7i234 failed: For input string: " Jim Morrison"