gchq / CyberChef

The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis
https://gchq.github.io/CyberChef
Apache License 2.0
29.37k stars 3.29k forks source link

Misc: Input text encoding question #322

Open leafcutterant opened 6 years ago

leafcutterant commented 6 years ago

This is not really an issue, so apologies in advance.

When I enter text manually into the Input field and all I have is a hashing operation in the Recipe, what text encoding is used to interpret the text? And is byte order mark used or not?

Also, is the encoding the same with other operations as well?

n1474335 commented 6 years ago

Hi @leafcutterant, good question!

The hashing operations all treat the input as UTF-8. The following recipe demonstrates this. Try disabling and enabling the 'Encode text' operation and you'll see that the hash output doesn't change.

https://gchq.github.io/CyberChef/#recipe=Encode_text('UTF-8%20(65001)')MD5()&input=2KfYrtiq2KjYp9ixINin2K7Yqtio2KfYsSDZhdix2K3YqNinINin2YTYudin2YTZhQ

The UTF BOM is not included.

UTF-8 is used by default for all CyberChef operations. There may be some edge cases where it is deliberately not used, but there are normally good reasons for that. Certainly for the hashing and encryption operations, URF-8 should be assumed.

leafcutterant commented 6 years ago

@n1474335, thanks for the answer!

You gave me an idea and I made some tests, and I'm not sure it's UTF-8.

For a simplistic reference, I took the lowercase letter á (a-acute).

https://gchq.github.io/CyberChef/#recipe=Encode_text(%27UTF-8%20(65001)%27)MD5()&input=4Q

Encoding it to UTF-8 gives a different hash.

Also, encoding it to hex with CyberChef gives e1, which, according to Wikipedia, is the representation in Unicode, NCR and ISO 8859-1/2/3/4/9/10/14/15/16.

On the other hand, encoding the (right-to-left) first letter of your text (ا, Arabic alif) to hex gives d8 a7, which is the UTF-8 representation of the letter.

اá (Arabic alif + a-acute) gives d8 a7 c3 a1. The last two octets c3 a1 are UTF-8 for á, so it seems the default encoding is either Unicode / NCR / ISO 8859-1/2/3/4/9/10/14/15/16, and it changes to UTF-8 when there is a character falling outside of the default's encoding space.

I experienced the same with わá (Japanese hiragana wa + a-acute).

Could you confirm this?

leafcutterant commented 6 years ago

Hey @n1474335, I believe this is an important aspect of Cyberchef as a tool. Did you happen to have the time to investigate this behavior?

haavardw commented 3 years ago

Any progress or update on this?