cake-contrib / Cake.AddinDiscoverer

Tool to aid with discovering information about Cake Addins
MIT License
5 stars 6 forks source link

Support unicode characters in description #202

Closed pascalberger closed 9 months ago

pascalberger commented 2 years ago

Addin discoverer currently does not support unicode characters in the description. See https://github.com/cake-build/website/pull/2255/files (the merged description has manually been fixed, but addin discoverer know wants to update it)

Jericho commented 2 years ago

Somehow I feel this is not the first this has come up

Jericho commented 2 years ago

It's not that the discoverer does not support unicode characters per se. It's more that the way the file was manually edited is not compatible with the way the discoverer serializes and deserializes unicode characters.

The discoverer uses YamlDotNet to serialize objects to yaml and to deserialize the content of yaml files back into c# objects. This process works perfectly fine and unicode characters are absolutely supported but it's important to understand that YamlDotNet will serialize a unicode character such as \U0001F996 like so: \\U0001F996 (notice the back-slash is escaped which results in a "double" back-slash). This double back-slash is properly handled by YamlDotNet upon deserialization and converted into the appropriate unicode character.

If the yaml file is manually modified by removing one of the back-slash, YamlDotNet will interpret the resulting content as containing the following individual characters: \, U, 0, 0, 0, 1, F, 9, 9 and 6 as opposed to a single unicode character.

So the better question is: why was this file manually modified? I am guessing it's because the process that uses the content of the YAML file to generate the documentation on the web site uses a different deserialization process and therefore handles unicode characters differently.

Jericho commented 2 years ago

By the way, here's code very similar to the logic in AddinDiscover to demonstrate that unicode characters are indeed supported:

// This is the string that contains unicode characters
var originalString = "\U0001f996 Hello world \U0001f996";

// Let's create an object that we will subsequently serialize.
// This is a simplified version of the object that AddinDiscoverer serializes to the YAML files
var obj = new
{
    Description = originalString
};

// Serialize the object to a string
var sb = new StringBuilder();
using (var sw = new StringWriter(sb))
{
    var serializer = new YamlDotNet.Serialization.Serializer();
    serializer.Serialize(sw, obj);
}

// This is the string that AddinDiscoverer saves in the YAML files
var serializedContent = sb.ToString();

// Deserialize the content of the YAML file
var sr = new StringReader(serializedContent);
var yaml = new YamlStream();
yaml.Load(sr);

// Extract the content of the 'Description' node
var yamlRootNode = (YamlMappingNode)yaml.Documents[0].RootNode;
var description = yamlRootNode.Children[new YamlScalarNode("Description")].ToString();

// Assert that description matches the original string
Assert.Equal(originalString, description);
Jericho commented 2 years ago

I'm continuing my investigation and I just discovered that YamlDotNet's deserialization is capable of handling both the \\U... format and also the \U... format (see code sample below to see how I was able to demonstrate this).

This is great news: it means that manually modifying the YAML files to "fix" the unicode issue does not break the addin discoverer. The remaining issue is to figure out if we can alter YamlDotNet's serialization to use the "single backslash" format rather than the "double backslash" so that YAML content containing unicode characters can be used by the website documentation generation process as well as the discoverer.

// Local function to deserialize a string using YamlDotNet and return the content of a node called 'Description'
string DeserializeAndGetDescription(string serializedString)
{
    var sr = new StringReader(serializedString);
    var yaml = new YamlStream();
    yaml.Load(sr);
    var yamlRootNode = (YamlMappingNode)yaml.Documents[0].RootNode;
    var description = yamlRootNode.Children[new YamlScalarNode("Description")].ToString();
    return description;
}

var serializedWithYamlDotNet = "Description: \"\\U0001F996 Hello world \\U0001F996\"\r\n";
var manuallyModified = "Description: \"\U0001F996 Hello world \U0001F996\"\r\n";

var descriptionWhenSerializedWithYamlDotNet = DeserializeAndGetDescription(serializedWithYamlDotNet);
var descriptionWhenManuallyModified = DeserializeAndGetDescription(manuallyModified);

Debug.Assert(descriptionWhenSerializedWithYamlDotNet == descriptionWhenManuallyModified);

By the way, as I hope I made clear earlier, this would not be an issue if the documentation generation process used the same deserialization method as the discoverer. This issue is solely due to the fact that we seem to use different methods of deserialization which appear not to be 100% compatible.

Jericho commented 2 years ago

You can decorate a class property with a YamlDotNet attribute called YamlMember. This allows you to set the ScalarType to somewhat control how the content of this property gets serialized. There are 6 possible scalar types. I tested all six possible types and none result in the exact format we want. some of them result in the content in double-quotes and unicode characters formatted with "double backslash" while other styles result in unicode formatted with "single backslash" but the whole string is preceded with either >- or |- and the content is on a separate line.

See sample in the below screenshot:

image

I'm about ready to give up.

@pascalberger

Jericho commented 9 months ago

Evidently, we are not the only ones struggling with serializing an object containing unicode characters to a YAML files. Someone asked a question in the YamlDotNet repo and the YamlDotNet author responded that, according to the YAML spec, only printable characters are allowed in a YAML file and any other character (such as the T-Rex character in our case) is not allowed and must be encoded.

This means that the person who created Cake.Tyrannoport.yml did not respect the YAML spec and the fact that Addin.Discoverer wants to convert this character to it's encoded value is actually the right thing to do.