WIP: allow users to manage a user-defined state while parsing rows

dashbitco / nimble_csv

A simple and fast CSV parsing and dumping library for Elixir

https://hexdocs.pm/nimble_csv

772 stars 51 forks source link

WIP: allow users to manage a user-defined state while parsing rows #77

Closed diasbruno closed 8 months ago

diasbruno commented 1 year ago

This PR allows user to define and control some state (different from the parse state) while parsing each line from the CSV.

Users can pass a state and a function that runs for each row.

New options are:

init_user_state - the state that goes along with the parse state
state_transform_function - function with arity 2 state and parse state (:line or :header) and it should return the next state

To avoid breaking change, for the moment, parse_string_with_state was introduced to keep the previous parse_* functions to run exactly as they were.

diasbruno commented 1 year ago

I'm moving this to "Ready for review" just so we can discuss if this is an ok idea.

diasbruno commented 1 year ago

Doc is pending and tests using streams.

josevalim commented 1 year ago

I am not sure this is how I would go. My suggestion is for the parse_stream function to return a NimbleCSV.Struct that implements the enumerable protocol, including the count function, and the count function itself is optimized. :)

diasbruno commented 1 year ago

"Rebranded" this PR to be more general than just count. But, counting can be a functionality baked in using this NimbleCSV.Struct you are suggesting.

So, I'll leave this PR for the user state and open a new one for the count.

For this PR, I'll give to the state_transform function the current parsed row, and, add the remaining tests and documentation.

Thanks, @josevalim.

josevalim commented 1 year ago

Generally speaking, I don't want to expose the parser state or its transformation. I am concerned both about performance or the possibility we will expose too much of the internals, making it impossible to refactor later without breaking user code.

diasbruno commented 8 months ago

It doesn't introduces much of an overhead (and it must be optional).

It just enhances the state with a user state (a pair of {internal state, user state}), but user is not allowed to modify the internal state.

This patch allows the package to stream each row, so the user can run aggregates and other operations as parser is stream lines. I can be an optimization is some cases.

I'll have a look at this patch this weekend, to finish the remaining things (docs, tests...).

diasbruno commented 8 months ago

The patch doesn't receives the parsed row, because it was written for the counter issue (#12). It must be generalized to perform not only counting, but any aggregating operation.

josevalim commented 8 months ago

Hi, I understand the user is not allowed to modify the internal state, but the issue is, if we expose it, users are going to read it, and potentially modify it. And I’m not comfortable going down this route. If this is essential to your solution, then I believe it is not a good fit for this library, which on purpose was designed to have a small API surface

diasbruno commented 8 months ago

No problem, @josevalim. I understand your point.

I'll close this for now and run a fork of this repo, but I'll make this streaming line idea. Later on, if make sense, we merge libraries (don't wanna keep a separare library just for this).

Thanks for taking time to answer.