BurntSushi / rust-csv

A CSV parser for Rust, with Serde support.
The Unlicense
1.71k stars 219 forks source link

Single record deserialization is 250 times slower than json #253

Closed imbolc closed 2 years ago

imbolc commented 2 years ago

Maybe I'm doing something wrong, but it seems to work 250 times slower comparing to serde_json. I've tried the last (1.1.6) release and master's versions. Here's the code of the benchmark:

use serde::Deserialize;

const CSV_RECORD: &[u8] = "1,foo".as_bytes();
const JSON_RECORD: &str = r#"{ "id": 1, "text": "foo" }"#;

#[derive(Debug, Deserialize, PartialEq)]
struct Record {
    id: u32,
    text: String,
}

fn main() {
    let expected = Record {
        id: 1,
        text: "foo".into(),
    };
    assert_eq!(Record::from_csv(), expected);
    assert_eq!(Record::from_json(), expected);

    let times = 100_000;
    bench("csv", Record::from_csv, times);
    bench("json", Record::from_json, times);
}

impl Record {
    fn from_csv() -> Self {
        let mut rdr = csv::ReaderBuilder::new()
            .has_headers(false)
            .from_reader(CSV_RECORD);
        rdr.deserialize().next().unwrap().unwrap()
    }

    fn from_json() -> Self {
        serde_json::from_str(JSON_RECORD).unwrap()
    }
}

fn bench(name: &str, f: fn() -> Record, times: usize) {
    let start = std::time::Instant::now();
    for _ in 0..times {
        let _row = f();
    }
    let per_sec = times as f64 / start.elapsed().as_secs_f64();
    println!("{:>5}{:>10.0} / sec", name, per_sec);
}

And it's result on my laptop:

$ cargo run --release -q --bin serde-performance
  csv     54865 / sec
 json  13123199 / sec
BurntSushi commented 2 years ago

Your characterization of your benchmark is misleading. You're not so much benchmarking deserialization as you are measuring the initialization of a parser and the deserialization of a tiny sample.

Given that your description is not in sync with your benchmark, it's unclear to me what it is you're concerned about.

imbolc commented 2 years ago

I'm trying to deserialize a single csv line, what would be a more appropriate way of doing this?

imbolc commented 2 years ago

Just to give a context - I'm working with big csv files by indexing lines offsets. Then after retrieving a particular line using this index I need to deserialize it.

BurntSushi commented 2 years ago

Build the reader once and seek with it. xsv does this.

I'm on mobile or else I would write something more helpful.

CSV reader construction has never been optimized. It does all manner of things, including building a DFA. I'm sure there's room to optimize aspects of it (since I don't think anyone has tried), but I would expect you to hit an unsatisfying ceiling pretty soon. In theory we could expose something that uses an NFA to parse at the expense of parsing speed, but it's not clear to me that it's worth it.

BurntSushi commented 2 years ago

There are examples for seeking in the docs. See also the csv-index crate.

imbolc commented 2 years ago

Thank you, but I also bound to tokio::fs :)

imbolc commented 2 years ago

I found other people using the builder in performance critical places, e.g. decoding query strings: https://github.com/samscott89/serde_qs/blob/main/examples/csv_vectors.rs#L53

imbolc commented 2 years ago

The issue is seem to be with csv-core itself rather than with the builder:

test csv_builder ... bench:      16,003 ns/iter (+/- 914)
test csv_core    ... bench:      15,695 ns/iter (+/- 1,155)
test csv_line    ... bench:         240 ns/iter (+/- 14)
test serde_json  ... bench:         124 ns/iter (+/- 5)

Here's the benchmark code: https://github.com/imbolc/csv-line/blob/main/benches/csv-line.rs

BurntSushi commented 2 years ago

Yes, that's consistent with what I said. csv-core is where the DFA is built.

imbolc commented 2 years ago

Ah, ok. So if everything is as it meant to be and I already found a solution, I guess I close it. Thank you :pray: