getodk / central-frontend

Vue.js based frontend for ODK Central
https://docs.getodk.org/central-intro/
Apache License 2.0
32 stars 56 forks source link

Show error if selected CSV file contains null character #982

Closed matthew-white closed 3 months ago

matthew-white commented 3 months ago

In getodk/central#589, we briefly discussed what would happen if the user uses the entity upload modal to select a binary file. This PR tries to detect that case and show a better error message.

The "choose one" button to select a file limits the selection to .csv and .tsv files. However, a user could rename an Excel file to .csv to try to get it to pass. Also, the drop zone doesn't have a similar restriction, so a user could drag-and-drop an Excel file. This is what it looks like if an Excel file is dropped in:

You can see that Papa Parse is so confused that it concludes that the file is tab-delimited. šŸ˜… But you can see that Papa Parse isn't throwing an error here: it's returning binary data parsed as text. Here's an example of how Papa Parse is perfectly happy to accept a null character:

> Papa.parse('f\0o,bar')
{
  data: [ [ 'f\x00o', 'bar' ] ],
  errors: [],
  meta: {
    ...
  }
}

I also tried what happened if the header was valid, but the data after it contained a null character. It looked like the null character was inserted into the DOM, but wasn't displayed. The null character was sent to Backend: it looks like JSON is perfectly capable of encoding a null character. However, Backend returned a 500, and I saw a Postgres error in the Backend logs. The Postgres error was different depending on whether the null character was in the label or in a data property.

I found this article helpful in understanding how Postgres handles null characters: https://www.commandprompt.com/blog/null-characters-workarounds-arent-good-enough/. It sounds like the "unsupported Unicode escape sequence" error is associated with JSON data types.

All that said, I think it'd be nice to avoid showing binary data in Frontend as in the screenshot above. It'd be nice to show a better error instead. It doesn't seem completely trivial to determine whether a file is binary. However, I figure that most binary files will contain a null character. In this PR, I check for a null character when parsing the header, then check for a null character when parsing the rest of the data. Some corners of the Internet say that a null character is valid in a text file, but I can't think of a real use case.

If the header contains a null character, I show a simple error alert. I'm not rendering EntityUploadHeaderErrors because I don't want it to show the binary data from the header. Here's an example of selecting an Excel file:

Note that it's possible for a binary file not to contain a null character in its first line. That was true of a .pdf file I tried. In that case, EntityUploadHeaderErrors will be rendered (assuming the first line doesn't match the expected header).

If the file matches the expected header, but there's a null character in the data, then we will show the error in the "Data error" panel. It would've been more consistent to show the error at the top of the modal in an alert, but that's not easy to do with the way things are correctly plumbed. I also think this case is very unlikely.

What has been done to verify that this works as intended?

I wrote tests and tried it locally.

Why is this the best possible solution? Were any other approaches considered?

One thing I'm wondering is whether we should check for control characters other than a null character. It sounds like Postgres is uniquely displeased by a null character, which is maybe a reason to check for that specifically. All the binary files I tried contained a null character, and my understanding is that that's typical.

Instead of checking for a null character, we could take a totally different approach and check the MIME type of the file. However, apparently browsers determine the MIME type just by looking at the file extension (MDN). It wouldn't help in the case where a user renames an Excel file to .csv.

In a similar vein, we could use an npm package like file-type that actually reads a portion of the file to determine whether the file is a common binary format. However, I'd prefer to avoid an additional package if we don't really need it.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

The change to src/ is pretty small. I implemented the check as just an additional validation step. It shouldn't break existing code.

Before submitting this PR, please make sure you have: