The vanilla client will allow non-utf8 strings

C4K3 commented 5 years ago

I'm dealing with a player .dat file containing a book with non-utf8 content, generated by the vanilla client.

I don't know when it's possible, but apparently the vanilla client may allow non-utf8 data.

It'd be nice to somehow allow parsing this anyway.

atheriel commented 5 years ago

That is... an extremely annoying feature. Is it possible that you could parse it as a byte array? Or does it need to masquerade as a string?

DrHenchman commented 5 years ago

This is because NBT uses Modifed UTF8, not standard UTF-8 for encoding strings. You can find details of it on the following links:

You can think of this as encoding a string using UTF16 and then each u16 is then encoded individually into UTF8. On top of this, null characters (U+0000) are encoded as two bytes (C0 80). This results in no null bytes in the encoded form.

There are some crates out there for modified UTF8, however I can't speak for the quality of them. I would also be happy to provide a functioning implementation directly to this crate however I am fairly new to Rust so it may not be the most efficient implementation.

atheriel commented 5 years ago

Thank you for an excellent summary, @DrHenchman. In my view, there a couple ways we could try to fix this, some of which you are referring to indirectly:

Turn all strings in the API into byte vectors and require that users translate them into UTF-8 String types manually.
Turn all strings in the API into a new, modified-UTF-8-supporting type; or
Modify deserialization internally to handle the two-byte null encoding and parse the remainder into valid Rust strings (which must be UTF-8).

Of these, (3) is the only one that does not make the ergonomics of using the package significantly more painful. It would also be difficult to actually implement and may have fairly large performance implications.

But I do not think we should go down this path, for a few reasons.

The actual NBT specification explicitly gives the encoding of strings as "UTF-8", so it is Minecraft itself breaking the rules here. Of course, Minecraft is the implementation most users care about, but I have a strong suspicion that other existing NBT parsers/tools follow the spec and do not handle this case correctly, either. I looked at a few other popular parsers to confirm this.

Plus, I'm not sure we actually want to handle "string" data that explicitly embeds a null byte anyway. I cannot think of a legitimate reason player data would include it, and suspect that it is due to corruption, bizarre input error, or malice. Surely the correct thing to in @C4K3's case is remove the null byte from the input file.

As a result, I am going to close this as "won't fix" for now, although I am not 100% opposed to reversing this position if we encounter related problems in the future.

DrHenchman commented 5 years ago

Null bytes is one example of something which doesn't work and I totally get your justification for not handling it. However, there are plenty of other characters which will encode differently. If you don't modify the handling of null bytes, but still wish to be compatible with Minecraft https://en.wikipedia.org/wiki/CESU-8 is essentially what this is. The wiki page gives a great example of how this works.

Under the example section on this wiki page it actually gives a good example of how this differs. This is essentially for unicode characters which would normally be encoded as 4-bytes long in modified UTF-8 are encoded as 6-bytes.

Perhaps I misunderstand what this library is for. You indicated that it is a strict implementation of an NBT specification. I was under the impression it was created to decode Minecraft save files so if in doubt, between that Mojang has documented as the specification (I wouldn't be surprised if Mojang documented it incorrectly) and what Minecraft does, what Minecraft does is correct.

Of course, you are more than welcome to do whatever you wish with your crate.

atheriel commented 5 years ago

I completely misread the wiki article on modified UTF-8. I had thought it only treated null bytes as a special case, but I see now that it seems to treat all code points above 0xFFFF (the "basic multilingual plane") differently as well.

Incidentally, this means the real cause of @C4K3's strange input file is likely legitimate text outside the basic multilingual plane (perhaps in an obscure ancient African writing system), and not random null bytes as I had assumed.

Perhaps I misunderstand what this library is for. You indicated that it is a strict implementation of an NBT specification. I was under the impression it was created to decode Minecraft save files

To clarify: I was mostly complaining about the fact that they violated the original spec, not offering it up as a reason to avoiding fixing these problems. (And I remain extremely annoyed that Minecraft does this. The Unicode Consortium itself strongly discourages the use of CESU-8/MUTF-8, and yet Java uses it anyway.) But you are right to argue that making an effort to address this edge case is the correct thing to do. It looks like there is a nice crate for handling modified UTF-8 already. We should be able to incorporate it into the internal parser without too much trouble.

Honestly, I'm a bit surprised this hasn't surfaced more often with other popular NBT tools (including nbted, written by @C4K3), none of which seem to handle anything other than pure UTF-8.

MightyPork commented 5 years ago

I looked at the cesu8 crate and tested it a bit, and it seems to be reasonably well made. There's some interesting cases though:

ASCII and low UTF-8 are parsed efficiently using the std::str::from_utf8, with CESU-8 as a fallback.

As a consequence:

NUL byte is accepted in the input, alongside C0 80, both becoming \x00
Proper UTF-8 is accepted normally, alongside the CESU-8 surrogate encoding

This seems reasonable for NBT, as it will accept even improperly encoded strings from other tools (or NBT encoded by an older version of this crate)

Encoding always goes through the CESU-8 encoder, so \x00 becomes C0 80 and supplementary codepoints use the CESU-8 surrogate encoding. This should be compatible with Minecraft, but it's a change from the previous version of this crate.

If this is what you want and no work started yet, i'd be willing to try to make a PR for this.

atheriel commented 5 years ago

Yes, I think it makes sense to be permissive on input and correct on output. I've implemented the change in a small patch, 6da900b, and it seems to work correctly. Thanks for all of your contributions to this discussion.

@C4K3 Does your file decode correctly with the master branch?

C4K3 commented 5 years ago

Yep, seems to have done it. Thanks!

PistonDevelopers / hematite_nbt

The vanilla client will allow non-utf8 strings #39