airbreather / Cursively

A CSV reader for .NET. Fast, RFC 4180 compliant, and fault tolerant. UTF-8 only.
MIT License
38 stars 2 forks source link

Read sync input as IEnumerable<ImmutableArray<string>> #21

Open airbreather opened 4 years ago

airbreather commented 4 years ago

On CsvSyncInputBase:

public IEnumerable<ImmutableArray<string>> AsEnumerableFromUTF8()
{
    return AsEnumerableFromUTF8(UTF8FieldDecodingParameters.Default); // CsvReaderVisitorWithUTF8HeadersBase.DefaultMaxHeaderLength
}

public virtual IEnumerable<ImmutableArray<string>> AsEnumerableFromUTF8(UTF8FieldDecodingParameters fieldDecodingParameters)
{
    // ...
}
airbreather commented 4 years ago

Base implementation probably needs to spin up a thread so that the visitor methods can block while our consumer processes the results.

That said, I think every subclass can do a lot better by reimplementing their input reading logic.

airbreather commented 4 years ago

Hmm, a more flexible API would be to have just one virtual method that returns something that can be configured like how the input classes work.

I kinda want to limit the default max field length and let callers control it.

airbreather commented 4 years ago

Then again, I've come down elsewhere on the side of, "if you have some encoding other than UTF-8, then most helpers won't work for you". So only one parameter is needed.

airbreather commented 4 years ago

Ugh. The ideal would be IEnumerable<IEnumerable<ReadOnlySpan<char>>> so that the consumer can choose whether or not to do a .ToString() themselves, at no meaningful extra cost on top of us allocating the string.

But of course, ReadOnlySpan<char> is a ref struct, so I can't do exactly that. ReadOnlyMemory<char> can get captured and stored, so I can't safely pool allocations there.

Custom ref structs with GetEnumerator() can't be used with LINQ, and this feature comes down on the "productivity" side of things rather than the "bleeding-edge performance" side of things.

And IEnumerable<IEnumerable<string>> is slightly less productive than IEnumerable<string[]>, but the latter suggests a parameter to control the max number of fields in a record.

Ugh.

airbreather commented 4 years ago

So only one parameter is needed.

Ugh. Of course that's not true: we also have DecoderFallback.

Ugh.

airbreather commented 4 years ago

IEnumerable<ImmutableArray<string>> might make more sense than IEnumerable<IEnumerable<string>>?

Someone might pull the next element from the outer sequence before fully consuming the previous element from the inner sequence (which is really easy to do with .AsParallel), so the latter approach would require the outer enumerable's state machine to ensure that the current inner enumerable's records get buffered before moving on.