Move to CSV crate - Githubissues

teburd commented 7 years ago

The latest CSV crate seems to have incorporated what made quick-csv so fast as well as enables serde to serialize/deserialize rows. Regardless of serde the csv crate is far more popular and lively and probably worth switching to.

medwards commented 7 years ago

I had to throw out all my code, but this is pretty promising. However I just ran cargo bench and the results aren't good:

before

running 9 tests
test bench_agency_iterator        ... bench:         964 ns/iter (+/- 100) = 117 MB/s
test bench_calendar_date_iterator ... bench:         874 ns/iter (+/- 1) = 53 MB/s
test bench_calendar_iterator      ... bench:       2,607 ns/iter (+/- 71) = 61 MB/s
test bench_frequency_iterator     ... bench:       3,623 ns/iter (+/- 73) = 95 MB/s
test bench_route_iterator         ... bench:       3,221 ns/iter (+/- 48) = 96 MB/s
test bench_shape_iterator         ... bench:         821 ns/iter (+/- 11) = 87 MB/s
test bench_stop_iterator          ... bench:       4,794 ns/iter (+/- 90) = 124 MB/s
test bench_stop_time_iterator     ... bench:       7,378 ns/iter (+/- 129) = 151 MB/s
test bench_trip_iterator          ... bench:       4,762 ns/iter (+/- 32) = 86 MB/s

after

running 11 tests
test bench_agency_iterator          ... bench:      21,758 ns/iter (+/- 416) = 5 MB/s
test bench_calendar_date_iterator   ... bench:      21,557 ns/iter (+/- 1,031) = 2 MB/s
test bench_calendar_iterator        ... bench:      23,141 ns/iter (+/- 114) = 6 MB/s
test bench_fare_attributes_iterator ... bench:      22,173 ns/iter (+/- 106) = 4 MB/s
test bench_fare_rules_iterator      ... bench:      22,441 ns/iter (+/- 203) = 4 MB/s
test bench_frequency_iterator       ... bench:      25,033 ns/iter (+/- 150) = 13 MB/s
test bench_route_iterator           ... bench:      24,054 ns/iter (+/- 696) = 12 MB/s
test bench_shape_iterator           ... bench:      21,194 ns/iter (+/- 186) = 3 MB/s
test bench_stop_iterator            ... bench:      26,219 ns/iter (+/- 208) = 22 MB/s
test bench_stop_time_iterator       ... bench:      31,360 ns/iter (+/- 94) = 35 MB/s
test bench_trip_iterator            ... bench:      26,242 ns/iter (+/- 60) = 15 MB/s

A couple of things could be at work here:

custom field deserializers might be written or applied badly (but bench_agencies_iterator would still be relatively fast)
automatic deserialization could be too generic (so csv crate StringRecords plus the old parse functions might be better)
csv crate doesn't have the speed improvements that quick_csv has (or they don't apply when running with serde)

Overall though its a solid 300 less lines of code to maintain (400 if you measure from 0.2.0). It's also dirt simple to add support for new files now (fare_rules.txt was already supported, just added tests, fare_attributes.txt just needed custom deserializers for a few fields).

medwards commented 7 years ago

I just tweaked bench_agency_iterator to use Reader::records instead of deserializing to see if its serde support that's killing performance and I still got similar benchmark numbers.

Maybe I'm using it wrong, but csv is just fundamentally still slower than quick_csv.

Lemme know if you still want a pull request for this.

teburd commented 6 years ago

That is incredibly unfortunate, I was hoping to save maintenance costs because it would automatically match fields using the serialize/deserialize for many fields. Taking a closer look today at why its so much slower

teburd commented 6 years ago

The biggest issue seems to be in our benchmarks we are creating a new CSV reader each time we read the file. I kid you not when I say creating that new csv Reader type is the bain of the benchmarks. I'm updating some code to avoid that as thats not really what I meant to benchmark to begin with.

teburd commented 6 years ago

screenshot from 2017-11-05 18-41-53

teburd commented 6 years ago

Even more concerning is the newer CSV crate actually has a benchmark showing MBTA stop times working much faster

https://github.com/BurntSushi/rust-csv/blob/master/benches/bench.rs#L22

medwards commented 6 years ago

I tried to make a version of GTFSIterator::new that accepted a reference to csv::Reader but the de-serializing iterator moves content and the other version wants lifetimes so I gave up on trying to pull it out of the loop. (With this approach I'd also have to reset the Reader every iteration but I think thats do-able)

That is probably the right choice here though because yeah its all Reader building time. Here is the MBTA deserializing to String bench result:

test count_mbta_deserialize_owned_str      ... bench:   4,698,632 ns/iter (+/- 444,004) = 153 MB/s

If I hack up our benchmark to to use the MBTA data then we get:

test bench_stop_time_iterator       ... bench:   6,412,457 ns/iter (+/- 395,219) = 112 MB/s

So pretty comparable and considering the extra work that has to be put into enum fields I think adequate for now.

Ideally I'd like to pull out the iterator building entirely but I don't see a way to do that (and apparently neither does BurntSushi ;)). Failing that: We're still slower than the quick_csv StopTimes iteration but I think its not a totally fair comparison (TimePoint parsing was adding, shape distance is optional) not am I 100% confident my serde deserializers are optimal. I think its still close enough, but its also worse enough that I understand if you still have concerns.

I'm trying to resolve the other major regression (leading/trailing whitespace) in upstream (BurntSushi/rust-csv#78) but I'm stuck there too XD

teburd commented 6 years ago

I think going this route is much nicer for other reasons, I think we're ok. I've simplified the benchmarks using a macro very similiar to the csv crate. Feel free to modify. You are a contributor now and should be able to do whatever you need to directly to this repo if you like :+1:

medwards commented 6 years ago

OK its pushed this plus some other changes I had queued, I couldn't find the macro you mentioned in any of your branches so I'm assuming you'll apply it from your local stuff.

Thank you for making me a contributor BTW. Unless you want to change the process I plan on submitting issues for features to get feedback and signpost where I'm going. I'll go forward changes in general if I'm not conflicted about how to approach things. I am a deep fan of code reviews so feel free to post-hoc file issues.

georust / transitfeed

Move to CSV crate #3

before

after