YarnSpinnerTool / YarnSpinner-Rust

The friendly tool for writing game dialogue in Rust
https://janhohenheim.itch.io/yarnspinner-rust-demo
Apache License 2.0
173 stars 15 forks source link

Yarn files with UTF-8 with BOM encoding cause compiler to panic or return errors #191

Closed JoshLee0915 closed 5 months ago

JoshLee0915 commented 5 months ago

Overview

I am working on a GDExtension for Yarn using the Rust lib since it looks to be more actively maintained and supported then the C++ one in the unreal project. While working on it I was unable to get some files to compile. I tracked it to if the yarn file is encoded with UTF-8 with BOM.

If a BOM is present in the file the compile would fail with either a panic: [panic] called 'Option::unwrap()' on a 'None' value or a compile error of Missing title header for node.

I created a branch for the error here with two test yarn files. HellWorld.yarn will throw the compile error if set with UTF-8 with BOM and VNExampleDialogue.yarn throws the panic.

https://github.com/JoshLee0915/YarnSpinner-GDExtension/tree/bug/utf-8-bom-issue

Steps to replicate:

  1. Create a Yarn file with the encoding UTF-8 with BOM
  2. Create a new YarnCompiler and set the following options a. compilation_type: FullCompilation b. library: Library::standard_library()
  3. call read_file to load the file with the UTF-8 with BOM encoding
  4. call compile to compile the yarn program. This should either cause a panic or throw a compile error

It seems like it should be a simple error to fix and would be happy to do it myself but would like to know first if its preferred to be throwing an error like with if you try to load a UTF-16 encoding or if we should just convert it to UTF-8 with no BOM on the fly if it is detected?

stargazing-dino commented 5 months ago

Just a passerby (I have no idea what BOM is) but I looked at the official spec and it has this to say

File Format Yarn files must be a UTF-8 text file and should not have the BOM set. File extension should be .yarn.

JoshLee0915 commented 5 months ago

Ah, thanks! I was not aware of that doc (its been a bit since I worked with Yarn Spinner). In that case the code checking to ensure the file is UTF-8 just needs to be updated to also check that it does not have a BOM

stargazing-dino commented 5 months ago

Yeah sounds reasonable :)

janhohenheim commented 5 months ago

@stargazing-dino thanks for answering the question :) FYI, a BOM (Byte Order Mark) is a magic number at the very beginning of a file to say whether a UTF-16 or UTF-32 file uses little or big endian. Software that only accepts UTF-8 usually uses the BOM as a cheap detection to see if the input is in a wrong format.

JoshLee0915 commented 5 months ago

Oh shoot, I forgot the main reason I asked how best to handle this. The bad yarn file in it is from the Yarn-Godot repo (the VN example file to be exact) which uses the same lib as the unity plugin does. It does not mess with them bc C# strips the BOM from strings when they are read in, but Rust does not: https://github.com/rust-lang/rfcs/issues/2428

So, while their docs say UTF-8 only its not a hard requirement and the core Yarn Lib does accept files with and without the BOM. To maintain parity with the library we should actually be stripping the BOM out of the string when detected.

janhohenheim commented 5 months ago

@JoshLee0915 Ooooh, I see. Good point!