Azure / usql

U-SQL Examples and Issue Tracking
http://usql.io
MIT License
234 stars 683 forks source link

USQL and embedded newline characters #84

Open JRSAzure opened 7 years ago

JRSAzure commented 7 years ago

Team,

EXTRACT ISSUE:

I have a comma delimited text file that has a COMMENT field which includes some new line characters. This field is inside a double quote string, such as "This is my \n comment \n" I have tried to escape them, and even write a custom C# extractor, but I can't seem to avoid the row from being chomped short.

Running the EXTRACT with the silent switch on works, but I'm throwing away valuable data.

Any help is appreciated, I have spent many hours on a solution.

Thanks,

Jim

sinaakhtar commented 6 years ago

I have the same problem. Could you share the code for what you have done and I can try to take it forward?

Grayincarnation commented 6 years ago

Exact same issue here. Any word on a solution?

ghost commented 6 years ago

Same issue, how can we escape line breaks inside a string?

sinaakhtar commented 6 years ago

I had to write a custom extractor to solve this.

Grayincarnation commented 6 years ago

@sinaakhtar It would greatly help if you could share your custom extractor solution.

MKadaner commented 6 years ago

If you can do some escaping (and I assume you can since you say you tried to), please try to replace carriage-return (U+0013) with two characters: backslash followed by lowercase 'r' (U+005C U+0072), and linefeed (U+0010) with backslash followed by lowercase 'n' (U+005C U+006E). Then use escapeCharacter parameter of the built-in text extractor:

USING Extractors.Csv(quoting : true, escapeCharacter : '\\') // quoting is true by default, but it does not hurt to repeat.

You will also need to escape the escape character itself, i.e., to double all backslashes. Here is a sample C# expression (an interpolated string), which will do the trick:

$"\"{COMMENT.Replace(@"\", @"\\").Replace("\"", "\"\"").Replace("\r", @"\r").Replace("\n", @"\n")}\""
JRSAzure commented 6 years ago

Team,

Sorry, but I never found a solution to this issue. It is very disappointing for the ADLA platform that a simple thing like embedded newlines inside a quoted string can cause such a problem. I know the distributed nature of the file adds complexity, but I'm not sure why this has not been fixed yet.

MKadaner commented 6 years ago

@JRSAzure will the solution suggested in my comment above work for you? The only precondition is that you need to escape the data in a special way. Based on your original question, I assume you can do this.

I know the distributed nature of the file adds complexity, but I'm not sure why this has not been fixed yet.

It's not about complexity. This is not possible in general. We have some overlap between the distributions, so that the previous and the next extractors both see the same (limited) segment of the file. The previous extractor reads the last record, even if it spans into the next distribution. The next extractor skips the tail of the last record of the previous distribution. To do so, it looks for a record delimiter within the overlap. If the delimiter was inside a quoted string and was escaped, the escape sequence may be outside (before) the overlap area, so that the next extractor have no chance to see it. This case is indistinguishable from the normal end-of-record delimiter.

One may think of more sophisticated "solutions", but please believe me, none of them work. One have to choose between parallel extraction and record delimiters inside quoted strings.

JRSAzure commented 6 years ago

Hello @MKadaner ,

Thank you for your insight, so I need to build a custom C# extractor, and use this type of escaping logic in a STREAMREADER loop when I get to that column position?

$"\"{COMMENT.Replace(@"\", @"\\").Replace("\"", "\"\"").Replace("\r", @"\r").Replace("\n", @"\n")}\""

Please advise,

Jim

MKadaner commented 6 years ago

Hello @JRSAzure, I apologize for the confusion. With the data format you described in the first post, custom extractor won't work for the same reasons we cannot do it in the built-in extractor.

The main question is where the data come from and whether you can escape the strings before they get into your input files. If you do not have control over the data cooking process, your only solution is to sacrifice parallel extraction and add the [AtomicFileProcessing=true] attribute to the custom extractor.

lalithakiran commented 6 years ago

I have the same issue. I have been trying to extract jsons from Avro files. I am getting \r\n embedded in nested json. I am not sure who to replace with a new line. here is i am getting. "source": "{\r\n \"vendorCd\": \"G\",\r\n \"brandCd\": \"4\",\r\n \"channelCd\": \"C\",\r\n \"accessTypeCd\": \"I\"\r\n}"

ChrisJr commented 6 years ago

@lalithakiran have you been able to figure out a solution? I am currently having the same issue as you and looking for possible solutions.

sinaakhtar commented 5 years ago

@sinaakhtar It would greatly help if you could share your custom extractor solution.

Apologies, for the late response. Here's the extractor code. You might need to modify it to fit your use-case.

ExtractTsv.zip

lemay007 commented 3 years ago

Is this thread still live? I have an issue in my EXTRACT statement. I'm trying to use escapeCharacter : '\r\n\' as this is the pattern that keeps splitting my rows, but I can't seem to get the code to work. Is there a way to escape a pattern of characters in U-SQL?