Open oliverjanik opened 4 years ago
There's a constructor that lets you switch off the bom. See https://docs.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netcore-3.1#System_Text_UTF8Encoding__ctor_System_Boolean_
True, it's easy for somebody to construct this themselves. But a devil's advocate argument: we define an internal property UTF8NoBom
in around a half-dozen places throughout our codebase. If we're doing this frequently then it stands to reason that our customers are doing it frequently as well.
True. I'm in favor of this as well. It was more me making sure people immediately see that there is an easy workaround when they land on this page via google.
@tarekgh Do you have any thoughts on this?
No objection here. I am just wondering, do we need to add properties like UnicodeNoBom
and UTF32NoBom
?
Minor issue, regarding naming. I think 'Bom' is ok but, is it better to use UTF8NoBOM
? we did a similar thing when we used ISO
and UTF
. Or use the full name UTF8NoByteOrderMark
.
I don't know if there's much evidence that people use UnicodeEncoding
or UTF32Encoding
anywhere near as much as UTF8Encoding
. But if we had such evidence then an accelerator on those might also make sense.
Unicode encoding usually used with System.IO. UTF32 is rarely used. We can wait not exposing any properties for these till we get a demand.
we define an internal property UTF8NoBom in around a half-dozen
Most of these internal places define this property as NoBom + throwing. Do we want the encoding returned by this property to be non-throwing or throwing? Do we want to have both variants (throwing and non-throwing)?
I think providing both options (throwing and non-throwing) would be the best. but the priority would be for non-throwing as users will have more control.
A simple UTF8NoBOM
property should be non-throwing for consistency with the other properties in the Encoding
class (UTF8, Unicode, BigEndianUnicode, etc...) which are all non-throwing.
My understanding is that a non-throwing instance is always used for reading (relaxed), whilst a throwing instance is always used for writing (strict), so it would be great if both options were available.
a non-throwing instance is always used for reading
That is not correct. For example, the new JsonReader uses throwing to reject invalid UTF8 data: https://github.com/dotnet/runtime/blob/master/src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs#L39
From security point of view, it is always better to use throwing variant because of it prevents invalid data from getting through the system.
I won't disagree if the simple property is non-throwing for consistency with the other properties.
the new JsonReader uses throwing to reject invalid UTF8 data
Interesting, thanks. I read the comment "The high level goal is to be tolerant of encoding errors when we read and very strict when we write" in different files in the codebase (StreamReader/StreamWriter) and assumed it was a general principle.
Repathing to future because we're still not sure if including this is the right thing for the product. The workaround is really simple: instantiate the class yourself with whatever arguments you require. The only real place we'd be able to add significant value is if we were to take advantage of JIT devirtualization or some other perf optimization that folks outside the runtime couldn't gain.
Keep open please, I keep adding this property to every lib/project.
Seems like a common practice https://github.com/search?q=Utf8NoBom+language%3AC%23&type=code&l=C%23
The
System.Text.Encoding.UT8
encodes using BOM which is generally not recommended.Can we have a
UTF8NoBom
property for convenience ?