brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 64 forks source link

Loading text with BOM encoding #5268

Open philrz opened 1 month ago

philrz commented 1 month ago

tl;dr

A community zync user mentioned the following in a Zui context, though it seems addressing it would likely have to start at the Zed layer since that's what handles data loading. In their own words:

I noticed Zui does not properly ingest JSON files which have been saved using UTF8BOM encoding, while UTF8NOBOM works well. I was also asking myself whether it would useful to be able to specify the file encoding in Zui, à-la iconv.

Details

Repro is with Zui Insiders 1.17.1-insiders.20 which uses Zed commit 556f586.

I was not personally familiar with byte-order mark (BOM) but gave myself a quick crash course so I could repro the user's experience.

This first video shows an example in the Windows context. Notepad's default selection to Save As in "UTF-8" creates a text file that Zui loads without complaint. However, if the "UTF-8 with BOM" option is selected, indeed the text format is no longer recognized by Zed's auto-detect as a format that can be read.

https://github.com/user-attachments/assets/dabf3a7e-237b-4f6c-819c-597a70c0b07f

That same page on Wikipedia also mentions that Google Docs adds the BOM when selecting the "Plain Text" download format, and indeed, this second video shows this causing the same error message we just saw with the file saved from Notepad.

https://github.com/user-attachments/assets/c1207904-9656-4071-98e5-cb109305a0a8

I'm not certain if it's feasible for Zed's auto-detect to recognize and react to the BOM without disturbing the ability to auto-detect the other formats supported by Zed. If not, I suppose we could add an explicit reader options indicating to expect the BOM and read in the specified format, per the user's iconv comment.

I don't know if it would do the trick, but I did some web searches and found https://pkg.go.dev/github.com/dimchansky/utfbom which it sounds like it may be what's needed:

The package utfbom implements the detection of the BOM (Unicode Byte Order Mark) and removing as necessary. It can also return the encoding detected by the BOM.

If the returned encoding is accurate, then perhaps this is the info that would be needed to automatically react to the BOM when it's present and read in the encoded format.

Per the iconv comment, https://github.com/djimenez/iconv-go also looks like it might be relevant.

This issue also reminded me of #4348, though that one appears to be only about encoding in a non-BOM context.