elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.18k stars 3.5k forks source link

Double, double quoting #3878

Open bobbyhubbard opened 9 years ago

bobbyhubbard commented 9 years ago

While attempting to parse a perfectly legit CSV, we're getting exceptions like the following:

{:timestamp=>"2015-09-04T17:50:39.634000-0500", :message=>"Trouble parsing csv", :source=>"message", :raw=>"<P style=\"\"MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px\"\">&nbsp;</P>", :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>, :level=>:warn}

As you can see, our csv field has embedded html which requires the use of the double quote in its content... necessitating two double quotes in order to differentiate between the field start/end and content inside the field. The RFC explains it better. Per RFC4180 (paragraph 7),

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

The logstash csv filter doesn't seem to account for the double double-quote scenario and therefore ends a field prematurely and spits out these malformed csv errors. I honestly haven't had a time to dig into the logstash code but I'm thinking this is a bug.

This is forcing us to use the elasticsearch-csv-river when we'd rather move to logstash. Has this been identified before? Any suggested workarounds?

purefield commented 9 years ago

+1

eugeneduvenage commented 8 years ago

I am having the same issue with database errors that contain objects using quoted identifiers. My input data looks like "some exception querying ""sometable"" occurred" which as was mentioned is valid csv according to RFC4180 (paragraph 7).

brunocopelli commented 7 years ago

+1