BurntSushi / rust-csv

A CSV parser for Rust, with Serde support.
The Unlicense
1.72k stars 219 forks source link

Disable line terminator config #331

Closed scMarkus closed 1 year ago

scMarkus commented 1 year ago

What version of the csv crate are you using?

csv: 1.2.2 csv-core: 0.1.10

Briefly describe the question, bug or feature request.

When using wtr.write_field() the following code fails since wtr.write_record(None::<&[u8]>)?; has not been called as documented. Adding said line not only adds the missing quoting character but additionally adds \n to the respective line. Both escape and terminator are configurable in csv::WriterBuilder but is there a way to disable them entirely?

Include a complete program demonstrating a problem.

no closing quote

fn example1() -> Result<(), Box<dyn Error>> {
    let mut wtr = Writer::from_writer(vec![]);
    wtr.write_field("hallo \" world")?;

    let data = String::from_utf8(wtr.into_inner()?)?;
    assert_eq!(data, "\"hallo \"\" world\""); //  `"\"hallo \"\" world"`
    Ok(())
}

additional line feed

fn example2() -> Result<(), Box<dyn Error>> {
    let mut wtr = Writer::from_writer(vec![]);
    wtr.write_field("hallo \" world")?;
    wtr.write_record(None::<&[u8]>)?;

    let data = String::from_utf8(wtr.into_inner()?)?;
    assert_eq!(data, "\"hallo \"\" world\""); // `"\"hallo \"\" world\"\n"`
    Ok(())
}

desired behavior

fn example3() -> Result<(), Box<dyn Error>> {
    let mut wtr = WriterBuilder::new()
        .terminator(None::<&[u8]>)) // or similar ???
        .from_writer(vec![]);

    wtr.write_field("hallo \" world")?;
    wtr.write_record(None::<&[u8]>)?;

    let data = String::from_utf8(wtr.into_inner()?)?;
    assert_eq!(data, "\"hallo \"\" world\""); // \"hallo \"\" world\"
    Ok(())
}

some background for this request

At the moment I am trying to contribute to the open source tool vector. Specifically #17261. The tricky part is that the csv crate is only used as a line encoder since the so called framing is done independently by that tool.

https://github.com/vectordotdev/vector/blob/4b80c714b68bb9acc2869c48b71848d11954c6aa/lib/codecs/src/encoding/format/csv.rs#L78-L100

Therefore I am curious if there is a way to disable the terminator? If not I would be quite happy if the Terminator enum could add an additional special value implementing this behavior.

In regards to the escape() character from what I am seeing the csv::QuoteStyle::Never would likely disable all quoting / escaping? This would have been my second request to handle corner cases where such behavior is desired.

https://github.com/BurntSushi/rust-csv/blob/574ae1ff64693b42ae0ce153926d9a0a5d546936/csv-core/src/lib.rs#L174-L175

BurntSushi commented 1 year ago

I don't understand why you would not want to write a terminator. It is part of writing data in a CSV format.

BurntSushi commented 1 year ago

You're going to need to spell this out in more detail for me. I understand you've mentioned some tool called vector, but your motivation is described in terms of that project's vocabulary. It's high context and I do not have the time to obtain that context. So you're going to need to break this down at a more fundamental level before I can understand the actual problem being solved here.

scMarkus commented 1 year ago

@BurntSushi Agreed. Let me try do state the issue in more general terms better highlighting my intention in context of handling CSV data:

When serializing a stream of events in a larger context the termination of such events may be handled in a different place. Still rust-csv can be used for encoding a single event. But at the moment this would forcefully add an additional termination character to the event which is not part of that events data.

As the grammar of RFC4180 shows the terminator is not part of a record but of a file.Which is further empathized by paragraph 2.2 of said RFC stating the last line having no need for a terminator.

The ABNF grammar [[2](https://datatracker.ietf.org/doc/html/rfc4180#ref-2)] appears as follows:
   file = [header CRLF] record *(CRLF record) [CRLF]
   header = name *(COMMA name)
   record = field *(COMMA field)
   name = field
   field = (escaped / non-escaped)
   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA
   COMMA = %x2C
   CR = %x0D ;as per [section 6.1 of RFC 2234](https://datatracker.ietf.org/doc/html/rfc2234#section-6.1)
BurntSushi commented 1 year ago

The terminators are record terminators. They aren't file terminators. The last terminator being optional doesn't change that.

I still don't understand the need here. It doesn't make sense to me to use csv in a context where it writes some part of the format and something else writes another part of the format.

If you need this level of control, you can use csv-core and just not call Writer::terminator.

scMarkus commented 1 year ago

I tinkered around with with csv-core and it seams to work out for me. Thanks a lot for that hint @BurntSushi Furthermore I have come across the finish() method which I am using now. If this might be made available in the csv api it would similarly solve my initial issue of events in a stream being kind of there last csv line each and every time. Anyhow thanks for your time.