georust / transitfeed

Public transit serializer/deserializer and manipulation library for Rust
Apache License 2.0
16 stars 4 forks source link

Move to CSV crate #3

Closed teburd closed 6 years ago

teburd commented 7 years ago

The latest CSV crate seems to have incorporated what made quick-csv so fast as well as enables serde to serialize/deserialize rows. Regardless of serde the csv crate is far more popular and lively and probably worth switching to.

medwards commented 7 years ago

I had to throw out all my code, but this is pretty promising. However I just ran cargo bench and the results aren't good:

before

running 9 tests
test bench_agency_iterator        ... bench:         964 ns/iter (+/- 100) = 117 MB/s
test bench_calendar_date_iterator ... bench:         874 ns/iter (+/- 1) = 53 MB/s
test bench_calendar_iterator      ... bench:       2,607 ns/iter (+/- 71) = 61 MB/s
test bench_frequency_iterator     ... bench:       3,623 ns/iter (+/- 73) = 95 MB/s
test bench_route_iterator         ... bench:       3,221 ns/iter (+/- 48) = 96 MB/s
test bench_shape_iterator         ... bench:         821 ns/iter (+/- 11) = 87 MB/s
test bench_stop_iterator          ... bench:       4,794 ns/iter (+/- 90) = 124 MB/s
test bench_stop_time_iterator     ... bench:       7,378 ns/iter (+/- 129) = 151 MB/s
test bench_trip_iterator          ... bench:       4,762 ns/iter (+/- 32) = 86 MB/s

after

running 11 tests
test bench_agency_iterator          ... bench:      21,758 ns/iter (+/- 416) = 5 MB/s
test bench_calendar_date_iterator   ... bench:      21,557 ns/iter (+/- 1,031) = 2 MB/s
test bench_calendar_iterator        ... bench:      23,141 ns/iter (+/- 114) = 6 MB/s
test bench_fare_attributes_iterator ... bench:      22,173 ns/iter (+/- 106) = 4 MB/s
test bench_fare_rules_iterator      ... bench:      22,441 ns/iter (+/- 203) = 4 MB/s
test bench_frequency_iterator       ... bench:      25,033 ns/iter (+/- 150) = 13 MB/s
test bench_route_iterator           ... bench:      24,054 ns/iter (+/- 696) = 12 MB/s
test bench_shape_iterator           ... bench:      21,194 ns/iter (+/- 186) = 3 MB/s
test bench_stop_iterator            ... bench:      26,219 ns/iter (+/- 208) = 22 MB/s
test bench_stop_time_iterator       ... bench:      31,360 ns/iter (+/- 94) = 35 MB/s
test bench_trip_iterator            ... bench:      26,242 ns/iter (+/- 60) = 15 MB/s

A couple of things could be at work here:

Overall though its a solid 300 less lines of code to maintain (400 if you measure from 0.2.0). It's also dirt simple to add support for new files now (fare_rules.txt was already supported, just added tests, fare_attributes.txt just needed custom deserializers for a few fields).

medwards commented 7 years ago

I just tweaked bench_agency_iterator to use Reader::records instead of deserializing to see if its serde support that's killing performance and I still got similar benchmark numbers.

Maybe I'm using it wrong, but csv is just fundamentally still slower than quick_csv.

Lemme know if you still want a pull request for this.

teburd commented 6 years ago

That is incredibly unfortunate, I was hoping to save maintenance costs because it would automatically match fields using the serialize/deserialize for many fields. Taking a closer look today at why its so much slower

teburd commented 6 years ago

The biggest issue seems to be in our benchmarks we are creating a new CSV reader each time we read the file. I kid you not when I say creating that new csv Reader type is the bain of the benchmarks. I'm updating some code to avoid that as thats not really what I meant to benchmark to begin with.

teburd commented 6 years ago

screenshot from 2017-11-05 18-41-53

teburd commented 6 years ago

Even more concerning is the newer CSV crate actually has a benchmark showing MBTA stop times working much faster

https://github.com/BurntSushi/rust-csv/blob/master/benches/bench.rs#L22

medwards commented 6 years ago

I tried to make a version of GTFSIterator::new that accepted a reference to csv::Reader but the de-serializing iterator moves content and the other version wants lifetimes so I gave up on trying to pull it out of the loop. (With this approach I'd also have to reset the Reader every iteration but I think thats do-able)

That is probably the right choice here though because yeah its all Reader building time. Here is the MBTA deserializing to String bench result:

test count_mbta_deserialize_owned_str      ... bench:   4,698,632 ns/iter (+/- 444,004) = 153 MB/s

If I hack up our benchmark to to use the MBTA data then we get:

test bench_stop_time_iterator       ... bench:   6,412,457 ns/iter (+/- 395,219) = 112 MB/s

So pretty comparable and considering the extra work that has to be put into enum fields I think adequate for now.

Ideally I'd like to pull out the iterator building entirely but I don't see a way to do that (and apparently neither does BurntSushi ;)). Failing that: We're still slower than the quick_csv StopTimes iteration but I think its not a totally fair comparison (TimePoint parsing was adding, shape distance is optional) not am I 100% confident my serde deserializers are optimal. I think its still close enough, but its also worse enough that I understand if you still have concerns.

I'm trying to resolve the other major regression (leading/trailing whitespace) in upstream (BurntSushi/rust-csv#78) but I'm stuck there too XD

teburd commented 6 years ago

I think going this route is much nicer for other reasons, I think we're ok. I've simplified the benchmarks using a macro very similiar to the csv crate. Feel free to modify. You are a contributor now and should be able to do whatever you need to directly to this repo if you like :+1:

medwards commented 6 years ago

OK its pushed this plus some other changes I had queued, I couldn't find the macro you mentioned in any of your branches so I'm assuming you'll apply it from your local stuff.

Thank you for making me a contributor BTW. Unless you want to change the process I plan on submitting issues for features to get feedback and signpost where I'm going. I'll go forward changes in general if I'm not conflicted about how to approach things. I am a deep fan of code reviews so feel free to post-hoc file issues.