Closed h-hasanov closed 7 years ago
In the latest version the degradation is even worse. Takes 210 sec to read 2M rows and 50 columns
Looking at the code I can see that the degradation appears due to creation of CsvPropertyMapData and TypeConverterOptions every time we call GetField
Good find.
The parser should be a lot faster in version 3.0.
Thanks for the reply. Do we think that the speed of the csv reader can match (or be very close) to the speed of StreamReader with simple line splits? Is the release date of V 3.0 set?
Simple line splits? As in, split on commas and split on line breaks?
Yes, ReadLine() for reading lines and then using line.Split(delimiter) for the cell entries
If it were that easy, there would be no need for this library. :) There are some special cases that ruin that. What if there is a comma or a newline in a field? It needs to follow these rules. https://www.ietf.org/rfc/rfc4180.txt
Fair enough :) I guess my ambition exceeded my talent :) I will see whether I can find some bits here and there that can improve the performance.
The parser itself should be very fast. Trying using CsvParser
on it's own. It's possible there is room for improvement there, but not much. CsvReader
will definitely have areas for speed improvements.
@JoshClose, I was wondering if there had been any considerations towards a multi-threaded approach?
How would that work? Not sure how you could chunk up a file to process it in multiple threads.
I'm away from my desk right now so can't test it, but could a possible starting point be:
Parallel.ForEach(File.ReadLines(...), row =>
{
// read row action
});
A line in the file doesn't represent a CSV line though. A field could have a \r\n
in it.
Good point; it would be interesting to run some analysis on what is consuming time. It may be possible to run parallelism deeper down, i.e. at the type conversion stage.
That stuff is generally really fast. There were definitely a few oversights from some of the changes made in 3.0 though. I think profiling some should bring most of those to light.
If there was a way to chunk up the file, which I can't think of one, that would definitely make things faster on multi-core. You would somehow have to have multiple readers open on the file, and match together the ending of one with the beginning of another, due to a field crossing that boundary. CsvParser
itself is very fast. CsvReader
has a good amount of room for speed improvements.
The only thing I can think of to make it multi-threaded is to create a a
single producer multi-consumer queue approach. One implementation of this
kind of approach is PLINQ, where the first IEnumerable
I think it's certainly possible, but it would not be feasible for anything
other than the .GetRecords
On Fri, Feb 17, 2017 at 2:21 PM, Josh Close notifications@github.com wrote:
That stuff is generally really fast. There were definitely a few oversights from some of the changes made in 3.0 though. I think profiling some should bring most of those to light.
If there was a way to chunk up the file, which I can't think of one, that would definitely make things faster on multi-core. You would somehow have to have multiple readers open on the file, and match together the ending of one with the beginning of another, due to a field crossing that boundary. CsvParser itself is very fast. CsvReader has a good amount of room for speed improvements.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JoshClose/CsvHelper/issues/637#issuecomment-280742129, or mute the thread https://github.com/notifications/unsubscribe-auth/AD_ohRQUObD03TpbTqgiznVtlj4yWAYMks5rdfNXgaJpZM4LyCww .
@vSkullv Do you still have your speed test? I'm curious where the state is today. Are your fields quoted in your example? What are the types that are being used in GetField
?
I'll have to test and see how much time is saved when reusing an object instead of newing it up each time.
So it looks like newing up the PropertyMapData
is taking up a lot of time, but the bigger issue is TypeConverterOptions.Merge
is really slow.
I'm testing with a file with 2 million records and 50 columns of all ints.
Parsing takes 13 seconds.
Reading with a reusable PropertyMapData
and setting a reusable TypeConverterOptions
but no merging takes 35 seconds.
Reading with a reusable PropertyMapData
and not setting the TypeConverterOptions
and no merging takes 39 seconds.
Reading with a reusable PropertyMapData
and setting a new TypeConverterOptions
but no merging takes 1:01 minutes.
Reading normally takes 4:03 minutes.
Reusing the objects saves a lot of time. The merging is taking up the most time, so I'll need to find a good way to optimize that.
I can reuse the same PropertyMapData
each time which speeds it up a lot.
I can create a cache and reuse TypeConverterOptions
per type and that speeds it up more.
I can change TypeConverterOptions.Merge
to not create a new type, and that speeds it up also.
With these 3 improvements, it's down to 35 seconds from 4:03 minutes.
I'm going to continue profiling and finding more. This is just for the GetField<T>
scenario, so I'm sure there are a lot of improvements on the GetRecord<T>
scenario too.
GetRecords<T>
only takes 23 seconds with no optimizations.
The speed optimizations are out on NuGet now.
I put the optimizations in for WriteField
too.
I noticed something interesting: When loading a CSV file (not mapped) I can see how the average bytes read per second decrease over time. I created a very simple solution that reads CSV files using streamreader and ReadLine. For a file with 2M Rows and 50 columns the speed of the simple solution is 20sec whereas CSVReader takes 67sec. All the 50 columns are integers (int32).
Can you suggest a way of improving the performance?