Unicode - Githubissues

vkoylazov commented 7 years ago

Hello,

How are Unicode strings handled in Cryptomatte? Are object names assumed to be in any particular encoding like UTF8? The specification doesn't mention anything in that regard.

Best regards, Vlado

jonahfriedman commented 7 years ago

Hi Vlado,

We've been talking about it internally and checking the tools. While we think that supporting UTF-8 is the obvious right thing to do, for the moment we can only really say that ascii characters work in the tools. The Nuke implementation almost works with UTF-8, if you type in the name. If you key it with a color picker it does not.

vkoylazov commented 7 years ago

Ok, thanks. 3ds Max is fully Unicode and people do tend to use non-ascii characters, so I need to know how to handle this case. For the moment I convert everything to UTF8.

I have to check with Nuke though - maybe MBCS on Windows would work better. However MBCS would make the generated files non-portable...

Best regards, Vlado

From: jonahfriedman notifications@github.com Sent: Thursday, June 22, 2017 11:05:53 PM To: Psyop/Cryptomatte Cc: Vladimir Koylazov; Author Subject: Re: [Psyop/Cryptomatte] Unicode (#16)

Hi Vlado,

We've been talking about it internally and checking the tools. While we think that supporting UTF-8 is the obvious right thing to do, for the moment we can only really say that ascii characters work in the tools. The Nuke implementation almost works with UTF-8, if you type in the name. If you key it with a color picker it does not.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Psyop/Cryptomatte/issues/16#issuecomment-310487844, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABgLWhyjyX0YhG2deHUwOewlyg6PhYmQks5sGskhgaJpZM4OARKd.

jonahfriedman commented 7 years ago

Hi Vlado,

I have made a very minor change to the nuke plugins that allows it to support utf-8 encoded strings. It's in the sidecar_manifests branch. I'm also going to raise this point on the committee. Please feel free to test with that. The change is really minor and as far as I can tell it does the right thing.

I'm on vacation until July 6 so it's unlikely to be in an official release until at least then.

vkoylazov commented 7 years ago

It doesn't seem to work quite as expected. Attached is a file where the teapot and the plane have Unicode names. If I pick an object, it is correctly listed in the "Matte List" knob, but no mask is created. There is no problem if I pick the box, which has only ASCII characters.

[EDIT] I can't seem to be able to attach the file, so I uploaded it here: https://ftp.chaosgroup.com/vlado/crypto_utf8_test.0000.exr

acjones commented 7 years ago

Thanks for sending the image.

From what I can tell it seems like the problem is that the computed hashes from VRay aren't the same as the ones produced in the Nuke plugin. On the Nuke side, if a numerical ID is found in the manifest, the value stored in the matte list is the string -- not the ID. When the keying expression is generated, it re-hashes the strings in the matte list.

The floating point value we're getting in the Nuke plugin for "равнина" is: -1.31926312124e-25

However, the ID value in the example image for the plane is: 5.55322013235e-17

The fact that "Box001" hashes the same way in both places suggests we must be doing the unicode conversion differently, or hashing the resulting bytes differently. On my end the unicode code points in the string are showing up as:

U+0440 U+0430 U+0432 U+043d U+0438 U+043d U+0430

In decimal: 1088 1072 1074 1085 1080 1085 1072

As far as I can tell, these are the intended code points, but that would be something to check.

The corresponding byte array I'm then sending to mmh3 is: d1 80 d0 b0 d0 b2 d0 bd d0 b8 d0 bd d0 b0

If I paste those into a UTF-8 decoder it also returns the string равнина.

In our case, we're using the pymmh3 pathway in the plugin, and then running it through the mm3hash_float function, returning the value -1.31926312124e-25. The integer result from pymmh3.hash is -1776070370.

Hopefully these intermediate values provide some clues as to why we're getting different results.

vkoylazov commented 7 years ago

Thanks for verifying this - turns out I had a bug in the hash code - signed char values where not handled correctly. After I fixed this, everything seems to work fine. The new OpenEXR file seems to work correctly: https://ftp.chaosgroup.com/vlado/crypto_utf8_testA.0000.exr

acjones commented 7 years ago

Ah, glad to hear it's sorted. I think it'll probably be helpful to include a few examples of this kind in the specification, or at least as an addendum, to help developers confirm each step of the hash is working as intended. It seems especially important after introducing Unicode into the mix.

vkoylazov commented 7 years ago

Yes, some examples like this could be helpful. F.e. in the beginning another issue that I had was with the Murmur hash implementation itself. We do have an existing implementationin the V-Ray SDK, but it seems different from the one you use for Cryptomatte and it took me a while to figure it out. I also had to look at the alShaders code to figure out some of the details on the hash generation. It would be nice if all the information was available in the specification, preferably in the form of sample code.

jonahfriedman commented 7 years ago

Interesting that your MurmurHash was different. Was it a different version or what it a different implementation of MurmurHash3?

Agreed about adding more information to the spec. Is this an exhaustive list of information to include?

Exact version of MurmurHash3 used
Example code of calling the murmurhash3 code from c++ with a char*
Specify chars are unsigned
Example string, integer hash result, floating point hash result
- With and without unicode characters

jonahfriedman commented 7 years ago

UTF-8 is in the Nuke implementation and the specification. Also in the spec, the code examples have been modified to use unsigned char*, and a couple of hashing examples have been added to allow double checking results.

Psyop / Cryptomatte

Unicode #16