TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

need to handle `^M` in string fields when persisting data to csv files. #1547

Closed AliTajeldin closed 5 years ago

AliTajeldin commented 5 years ago

If a string column field contains a ^M is saved to csv file, the ^M will confuse the hadoop text file input splitter and we will read partial records. To avoid this issue, we will translate ^M to some token that shouldn't exist in normal user code (e.g. __smv%M%smv__) and translate it back on read.

Note: This only needs to be fixed in smv 1.6 as we switching to parquet files for intermediate results in 2.x

AliTajeldin commented 5 years ago

This is fixed in release 1.6.2.4.p1. This will not be ported to master as we will use parquet on 2.x