JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.77k stars 1.07k forks source link

CSVReader performance degradation #637

Closed h-hasanov closed 7 years ago

h-hasanov commented 7 years ago

I noticed something interesting: When loading a CSV file (not mapped) I can see how the average bytes read per second decrease over time. I created a very simple solution that reads CSV files using streamreader and ReadLine. For a file with 2M Rows and 50 columns the speed of the simple solution is 20sec whereas CSVReader takes 67sec. All the 50 columns are integers (int32).

Can you suggest a way of improving the performance?

h-hasanov commented 7 years ago

In the latest version the degradation is even worse. Takes 210 sec to read 2M rows and 50 columns

Looking at the code I can see that the degradation appears due to creation of CsvPropertyMapData and TypeConverterOptions every time we call GetField. We can create a field __propertyMapData and only update typeconverter, index and cultureinfo rather than recreating it every time.

JoshClose commented 7 years ago

Good find.

The parser should be a lot faster in version 3.0.

h-hasanov commented 7 years ago

Thanks for the reply. Do we think that the speed of the csv reader can match (or be very close) to the speed of StreamReader with simple line splits? Is the release date of V 3.0 set?

JoshClose commented 7 years ago

Simple line splits? As in, split on commas and split on line breaks?

h-hasanov commented 7 years ago

Yes, ReadLine() for reading lines and then using line.Split(delimiter) for the cell entries

JoshClose commented 7 years ago

If it were that easy, there would be no need for this library. :) There are some special cases that ruin that. What if there is a comma or a newline in a field? It needs to follow these rules. https://www.ietf.org/rfc/rfc4180.txt

h-hasanov commented 7 years ago

Fair enough :) I guess my ambition exceeded my talent :) I will see whether I can find some bits here and there that can improve the performance.

JoshClose commented 7 years ago

The parser itself should be very fast. Trying using CsvParser on it's own. It's possible there is room for improvement there, but not much. CsvReader will definitely have areas for speed improvements.

GeekyEggo commented 7 years ago

@JoshClose, I was wondering if there had been any considerations towards a multi-threaded approach?

JoshClose commented 7 years ago

How would that work? Not sure how you could chunk up a file to process it in multiple threads.

GeekyEggo commented 7 years ago

I'm away from my desk right now so can't test it, but could a possible starting point be:

Parallel.ForEach(File.ReadLines(...), row =>
{
   // read row action
});
JoshClose commented 7 years ago

A line in the file doesn't represent a CSV line though. A field could have a \r\n in it.

GeekyEggo commented 7 years ago

Good point; it would be interesting to run some analysis on what is consuming time. It may be possible to run parallelism deeper down, i.e. at the type conversion stage.

JoshClose commented 7 years ago

That stuff is generally really fast. There were definitely a few oversights from some of the changes made in 3.0 though. I think profiling some should bring most of those to light.

If there was a way to chunk up the file, which I can't think of one, that would definitely make things faster on multi-core. You would somehow have to have multiple readers open on the file, and match together the ending of one with the beginning of another, due to a field crossing that boundary. CsvParser itself is very fast. CsvReader has a good amount of room for speed improvements.

jamesbascle commented 7 years ago

The only thing I can think of to make it multi-threaded is to create a a single producer multi-consumer queue approach. One implementation of this kind of approach is PLINQ, where the first IEnumerable is single threaded, but after the AsParallel call, the further consumers of those previous IEnumerables are executed in parallel, each being allocated a T to work on in it's own thread (technically a task, and that gets complicated).

I think it's certainly possible, but it would not be feasible for anything other than the .GetRecords() call I believe, as the other ones offer too much control to the user of when to advance the stream and get the results of the parsing/mapping process.

On Fri, Feb 17, 2017 at 2:21 PM, Josh Close notifications@github.com wrote:

That stuff is generally really fast. There were definitely a few oversights from some of the changes made in 3.0 though. I think profiling some should bring most of those to light.

If there was a way to chunk up the file, which I can't think of one, that would definitely make things faster on multi-core. You would somehow have to have multiple readers open on the file, and match together the ending of one with the beginning of another, due to a field crossing that boundary. CsvParser itself is very fast. CsvReader has a good amount of room for speed improvements.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JoshClose/CsvHelper/issues/637#issuecomment-280742129, or mute the thread https://github.com/notifications/unsubscribe-auth/AD_ohRQUObD03TpbTqgiznVtlj4yWAYMks5rdfNXgaJpZM4LyCww .

JoshClose commented 7 years ago

@vSkullv Do you still have your speed test? I'm curious where the state is today. Are your fields quoted in your example? What are the types that are being used in GetField?

I'll have to test and see how much time is saved when reusing an object instead of newing it up each time.

JoshClose commented 7 years ago

So it looks like newing up the PropertyMapData is taking up a lot of time, but the bigger issue is TypeConverterOptions.Merge is really slow.

I'm testing with a file with 2 million records and 50 columns of all ints.

Parsing takes 13 seconds.

Reading with a reusable PropertyMapData and setting a reusable TypeConverterOptions but no merging takes 35 seconds.

Reading with a reusable PropertyMapData and not setting the TypeConverterOptions and no merging takes 39 seconds.

Reading with a reusable PropertyMapData and setting a new TypeConverterOptions but no merging takes 1:01 minutes.

Reading normally takes 4:03 minutes.

Reusing the objects saves a lot of time. The merging is taking up the most time, so I'll need to find a good way to optimize that.

JoshClose commented 7 years ago

I can reuse the same PropertyMapData each time which speeds it up a lot.

I can create a cache and reuse TypeConverterOptions per type and that speeds it up more.

I can change TypeConverterOptions.Merge to not create a new type, and that speeds it up also.

With these 3 improvements, it's down to 35 seconds from 4:03 minutes.

I'm going to continue profiling and finding more. This is just for the GetField<T> scenario, so I'm sure there are a lot of improvements on the GetRecord<T> scenario too.

JoshClose commented 7 years ago

GetRecords<T> only takes 23 seconds with no optimizations.

JoshClose commented 7 years ago

The speed optimizations are out on NuGet now.

JoshClose commented 7 years ago

I put the optimizations in for WriteField too.