coursera / dataduct

DataPipeline for humans.
Other
252 stars 82 forks source link

Split does not do proper thing on lines with escaped newlines #228

Open cliu587 opened 8 years ago

cliu587 commented 8 years ago

Just to keep track of this issue introduced in https://github.com/coursera/dataduct/pull/227/files If you set the split property for an extract-rds step to be not the default value of 1, it will split improperly for rows with columns that have strings with newlines.

This is because we are using the split unix command, which cannot handle escaped newlines. I think it might be possible to fix this by transforming escaped newlines to a token character and then transforming it back after.