1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Output for FilteredIndex should not delimit by "," #79

Closed vinceecws closed 3 years ago

vinceecws commented 3 years ago

The SURT URL column contains "," in its values:

Screen Shot 2021-09-02 at 8 33 17 PM

So, a comma delimited output format for FilteredIndex will not be ideal. A better choice would be to delimit by whitespace.

Ahimsaka commented 3 years ago

I believe we resolved this yesterday when Vince added a delimiter parameter to IndexUtil.write

vinceecws commented 3 years ago

This issue was opened due to an ignorance of how Spark internally handles conflicts in field values with delimiters by wrapping the entire field with quotation marks.

Using the above example, if the delimiter specified in Spark is "," and the field contains "," like com,hellosummerville)/jobs/law-enforcement-security, Spark DataFrameWriter will automatically handle that by wrapping the field with quotation marks like so "com,hellosummerville)/jobs/law-enforcement-security".