JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.71k stars 1.06k forks source link

Modify how RecordHydrator works #1485

Open oicguy opened 4 years ago

oicguy commented 4 years ago

Is your feature request related to a problem? Please describe. Hydrating a record to a class is the slowest part of my application. This seems to stem from the fact that the RecordHydrator requests a CsvReader to do its job. However, when I take a closer look at the code, most of the uses of the CsvReader only use the Context (CsvReader.Context). There's only 1 statement that doesn't use reader.Context and that to test if it 'can read' (CsvReader.CanRead).

Now, i get that ReadingContext is like the state of the CsvReader so I'd imagine changing things would be difficult to do. And I'll say upfront that I only spent a half hour looking through this section to understand the problem and come up with a potential solution, so please forgive me for my naive explanation for a change.

What if instead of taking a CsvReader it took an IHydrationContext where IHydrationContext an interface containing the parts of ReadingContext needed for Hydration. ReadingContext could implement the interface but I don't believe it would be required. Assuming Read or ReadAsync has been called, the current record should be in the ReadingContext. We could then create an instance of the IHydrationContext from the ReadingContext and the values would then become independent of the CsvReader. The CsvReader is now free to move onto the next record while the RecordHydrator works on hydrating the previous record in a separate task.

I realize that because hydration becomes asynchronous and has partial copies of past ReadingContexts in memory that it will cause a memory spike. I believe that is for the developers to solve as I already have a solution for it but is specific to my application.

I'm not sure if this could also apply to GetRecordsAsync but it would enable a GetRecordAsync. The code using it could look like the following:

  while(reader.Read()){
    var task = reader.GetRecordAsync<MyClass>()
      .ContinueWith(t=> DAL.LoadData(t.Result));  //DAL is a Data Access Layer
    queue.Add(task);
  }
  await Task.WhenAll(queue.ToArray());

Without being required for a row to be fully hydrated, going through a file becomes as fast as just reading it. I'm willing to help if what I've suggested is at all doable. Please point out any flaws and let's work through them. Great job with this library thus far.

JoshClose commented 4 years ago

Interesting... Every row would needs it's own context. I'm re-writing the parser right now using MemoryPool, Memory, and Span, so I'll have to see how much it would slow things down to copy the memory so that it could be used later.

oicguy commented 4 years ago

After writing this I had the thought to see if I could execute the hydration code separately from using GetRecord<> and I now see how difficult a task that would be considering how tightly coupled everything is. But as I said, I'm willing to help if I can. Just let me know.

oicguy commented 4 years ago

I'll say once more thing about this. In a project I inherited not using CsvHelper, there was no model hydration of any sort so I had to implement it myself. It was nowhere near as fancy as the system CsvHelper has and was customized for exactly 1 object, so the comparison isn't exact. But initially I identified my hydration code as slow as well. What I ended you doing was making a Dictionary<string,string> of the row and calling a method that would do the hydration. Now because the method returned a task, the hydration was on a separate thread and file consumption sped up something ridiculous. A job that was estimated to take 12 hours to complete finished in 4... well the reading part did.

I'm not sure why the ReadingContext has so much 'stuff' on it but it feels like it's performing too many duties. What does HydrationActions have to do with parsing? Or ReusableMemberMapData? Maybe simply creating a HydrationContext that is focused on hydration and possibly dehydration you can then just pass the simple Dictionary<string,string> or whatever that makes that system faster and more asynchronous.

JoshClose commented 3 years ago

I'm releasing a new version soon that has a lot of speed improvements on both the parser and the reader. Maybe this is fast enough that it doesn't matter anymore. Otherwise I can look into options after this.

https://www.nuget.org/packages/CsvHelper/20.0.0-beta0020

JoshClose commented 3 years ago

Lots of things have changed. The speed is way faster, and the context is almost empty now. What are your thoughts on the current state?