getodk / central-frontend

Vue.js based frontend for ODK Central
https://docs.getodk.org/central-intro/
Apache License 2.0
32 stars 57 forks source link

Add CSV utility functions #956

Closed matthew-white closed 5 months ago

matthew-white commented 6 months ago

This PR makes progress on getodk/central#589, adding utility functions related to CSV files. parseCSVHeader() parses the header row of a CSV file, parseCSV() parses the data, and formatRow() displays a row in CSV format.

What has been done to verify that this works as intended?

This PR just adds utility functions; it doesn't make user-facing changes. I've added unit tests of the utility functions.

Why is this the best possible solution? Were any other approaches considered?

npm package

One important decision was around which npm package to use for CSV parsing. I first tried csv-parse, since that's what we use on Backend. However, I ran into a couple of issues:

I gave a different package Papa Parse a try instead, and I've been liking it. It's easy to stream the file, and I didn't have trouble parsing the header row. It also seems markedly faster than what I was doing with csv-parse (where I was calling text() on the File, then passing the resulting string to the parse() function). I tried parsing a file with 4 columns (3 properties + label):

# rows size csv-parse time Papa Parse time
100,000 8 MB 1.5s 0.4s
1,000,000 80 MB 14s 2s

Another little benefit of Papa Parse is the ability to easily abort parsing. I'm planning to set it up so that if the user leaves the upload modal, any ongoing CSV parsing is aborted.

Parsing the header and data separately

As mentioned above, there are separate functions for parsing the header vs. parsing the data. parseCSV() parses the data and expects to receive an array of column headers. That array comes from parsing the header row using parseCSVHeader().

I've set it up this way so that if there is an error in the header, the rest of the CSV file is not parsed. The upload modal will show errors about the header, and I want those to be shown as soon as the header has been parsed and validated.

That could still have been done in a single function, say, by passing a validateHeader option to the function. However, parseCSV() is already fairly complex, and I didn't want to make it even more so. parseCSVHeader() and parseCSV() also handle errors pretty differently. If Papa Parse finds errors in the header (of code MissingQuotes or InvalidQuotes), parseCSVHeader() will not return a rejected promise and will instead return the errors along with Papa Parse's best attempt to parse the header. That's because the upload modal will display both the errors and the header: the two are needed together. On the other hand, parseCSV() returns a rejected promise if there is an error with the data. In that case, the upload modal simply shows a danger alert. parseCSV() only returns data if it was successful. At that point, it may also return warnings about the data. parseCSVHeader() has no concept of warnings.

I don't think we lose much by separating parseCSVHeader() and parseCSV(). There end up being some conceptual differences between the two, so I think it's useful to do so.

Returning promises

Another part of the approach I took was to wrap Papa Parse in a promise. Papa Parse follows more of a callback pattern, but it is more convenient to work with promises. parseCSVHeader() and parseCSV() return promises, using Papa Parse callbacks under the hood. Also, Papa Parse provides its own API for aborting parsing, but in other places in Frontend (around requests), we use AbortSignals. parseCSVHeader() and parseCSV() will accept an optional AbortSignal, then use the AbortSignal to trigger the abort mechanism in Papa Parse. That way, most code can just use promises and AbortSignals, and for the most part, only src/util/csv.js needs to be concerned with the mechanics of Papa Parse.

Before submitting this PR, please make sure you have:

matthew-white commented 5 months ago

For some reason, I thought Papa Parse only returned quote errors for the header row. I'd tried passing Papa Parse a CSV string with an invalid quote after the header and didn't see an error. However, I must have missed something, because I'm now seeing that Papa Parse does return quote errors about the data after the header. I think that's a good development, as that's the better behavior. I've pushed a commit that specifically handles those errors, rejecting parsing with a better error message compared to before.