Hashing recipes use Latin-1 as default encoding

leader0001 commented 11 months ago

Hashing recipes like MD5 are using Latin-1 as default encoding. As a result, CyberChef produce different outcomes for certain inputs compared to the majority of other hash function generators, that use the UTF-8 standard.

Example: echo -n "ä" | md5sum ---> 8419b71c87a225a2c70b50486fbee545 result with CyberChef ---> c15bcc5577f9fade4b4a3256190a59b0

Using UTF-8 as the default encoding is crucial due to its adoption as a standard. It ensures compatibility across various systems and supports a wide range of characters, promoting seamless communication and data consistency.

Dhruva21 commented 10 months ago

@leader0001 , I have reproduced and tried to dig in deeper about the issue.

As you mentioned when I run echo -n "ä" | md5sum ---> 8419b71c87a225a2c70b50486fbee545.

I saw the resultant hexdump for the latin character ä or some other character "你好" from the terminal using the command echo -n "ä" | xxd -p --> hexdump -> c3a4 --> the resultant byte array is [195, 164]

I added the prints to the byte array before the calculation of md5 hash, I found that the byte array that is computed is [228].

Using this site (https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi) I found that hex dump, decimal code point.

Below is the screenshot of that:

I also tried to log for the other characters whose ascii value is in the range [0,127) --> I compared both on the terminal and cyberchef log, it looks okay to me.

Is my above observation correct? If yes can I proceed to add a fix to handle the resultant byte array or this is expected?

leader0001 commented 10 months ago

@Dhruva21, thank you for your interest.

The issue lies in the encoding used when hashing. If, for example, we hash "ä" from its hexdump (which, as you mentioned, is "c3a4") with CyberChef using 'Hex' as the encoding, we get the same result as with 'echo -n "ä" | md5sum'...

I'm attaching a screenshot for your reference. Captura de pantalla 2024-01-08 112827

leader0001 commented 10 months ago

Obviously, if we had obtained the hexdump using CyberChef, we would have a different result, "e4" instead of "c3a4," due to the encoding issue I mentioned.

Captura de pantalla 2024-01-08 113813

zb3 commented 7 months ago

@leader0001, you need to set the input encoding to UTF-8 instead of "Raw Bytes"

leader0001 commented 7 months ago

@leader0001, you need to set the input encoding to UTF-8 instead of "Raw Bytes"

I know that with that change, you get the same result as other MD5 generators give. The point is that UTF-8 is the standard encoding: you should get the same result with 'Raw Bytes' option as with 'UTF-8' option. Right now, CyberChef is applying Latin-1 to the 'Raw Bytes' option for MD5 and other algorithms, and that is what is inconsistent.

a3957273 commented 7 months ago

It's not unreasonable to assume that the default 'raw bytes' encoding would use UTF-8. Can anyone find out where in the code base we're converting the input string to bytes, I can't seem to track it down.

zb3 commented 7 months ago

@a3957273, updateInputValue does the encoding, the charset is stored in the InputWaiter and I believe changing the default charset there could also affect the default charset when reading from url and no "ienc" parameter is present in the url

https://github.com/gchq/CyberChef/blob/944810614a01849c8158dfeadb9139525bb6ad39/src/web/waiters/InputWaiter.mjs#L772

EDIT: to expand on that, values in the url are stored as base64[?] bytes, so when loading they're decoded here: https://github.com/gchq/CyberChef/blob/944810614a01849c8158dfeadb9139525bb6ad39/src/web/App.mjs#L530

a3957273 commented 7 months ago

The tracked encoding is stored in inputChrEnc:

https://github.com/gchq/CyberChef/blob/944810614a01849c8158dfeadb9139525bb6ad39/src/web/waiters/InputWaiter.mjs#L64

0 is the default, any higher number appears to specify the code page set to use. If 0 is set, we don't know the code page and so run strToArrayBuffer:

https://github.com/gchq/CyberChef/blob/944810614a01849c8158dfeadb9139525bb6ad39/src/core/Utils.mjs#L476-L489

By default this uses charCodeAt(). If any char code is higher than 255 we instead treat it as UTF-8 and call strToUtf8ArrayBuffer:

https://github.com/gchq/CyberChef/blob/944810614a01849c8158dfeadb9139525bb6ad39/src/core/Utils.mjs#L505-L520

Posting binary data, everything is fine. Posting UTF-8 data with a codepoint <128 or >256, everything is fine. Posting UTF-16 where all codepoints are <256, everything is fine. The only flaw is pasting UTF-8 with all codepoints under 256, and at least one character above 128.

We can't realistically switch the default charset to UTF-8 because that breaks binary data. Maybe we just need to throw an explicit error in this case and say "Please select whether this is binary data or some other charset?".

zb3 commented 7 months ago

If 0 is set, we don't know the code page and so run strToArrayBuffer

Oh, this was not what I expected when I saw the "Raw bytes" label.. I'd assume that "Raw bytes" was a choice, but what you are describing sounds more like "Autodetect"

EDIT: And I'd also assume that if one pasted data with codepoint >=256 while the "Raw bytes" option was on then an error would appear

a3957273 commented 7 months ago

Here's going to be my proposal:

Rename the default encoding to 'Autodetect'
Add in a 'raw bytes' encoding that is just that, raw bytes.

Bonus points if 'autodetect' includes what it's autodetected to (UTF-8 or raw bytes). Hopefully a user will then see that their ä is being parsed as 'raw bytes' and correct it if that wasn't their intention.

zb3 commented 7 months ago

I think there's a deeper problem and so "we're not there yet".. (I'm confused so hopefully my understanding of the architecture is incorrect) This is related to my comments on #1735 where I mentioned that some operations use string as their input even when what they want is not codepoints, but bytes. On the other hand some operations rightfully use string as their input, for example the "to lowercase" operation naturally wants to operate on codepoints not on bytes.

And that's where the problem starts.. I don't understand how that's supposed to work except for handling the special case where the text pased is already... utf8 encoded, which means the... "input" encoding is set to "Latin1" (what I'd imagine "raw bytes" to be).

Let me explain.. the way I see is that we always read codepoints from the input (as opposed to bytes). I know that if the (new, after changes) encoding is "raw bytes" then these would be used directly as byte values - I can understand this. I also know that if the encoding is set to utf8, those are first utf8 encoded so that bytes represent these codepoints we got from the input value (the difference visible if we use the "To Hex" operation).

But CyberChef supports more, right? So let me make a file with the string "ĄĄĄ" encoded in windows-1250 encoding and then drag it to CyberChef.

Sadly that - for me - doesn't work (CodeMirror calls readAsText which assumes utf8 encoding so characters are not read properly at all), after changing to windows-1250 I still see garbled text in the input, the only change being that it is then encoded so that this exact garbled text is represented in the windows-1250 encoding.. However, this is of course impossible because windows-1250 is not able to represent that character, the conversion fails - but silently - and NULL bytes are inserted instead.

But even assuming that file dropping would work correctly, then at best I'd not see the garbled text in the input anymore - I'd see my string "ĄĄĄ", BUT it'd then be encoded using the input encoding so when using "To Hex" we'd get the exact bytes that were in the file.. And now finally my main point/problem - the "to lowercase" operation is now unable to work, because the information about codepoints has been lost!

Internally I see that the data is represented as bytes - if a string inputType is expected, bytes are utf8 decoded and the assumption that these are either bytes or utf8 encoded text makes sense to me.

So given that we get codepoints on the input, I see the following use cases for "input encoding":

when dragging a text file that is encoded in a specific encoding, input encoding could be set to that encoding but.. in this case we should "get" bytes from the file or at least latin1 string yet currently it completely fails because CodeMirror calls readAsText which assumes utf8 encoding
when pasting already utf8-encoded bytes, raw bytes makes sense, but the actual "input encoding" is utf8 (!!)
when pasting codepoints (the text displayed correctly), the actual "input encoding" would be... identity (in theory) or more practically utf16 since that's what charCodeAt would read

So I'm quite lost here.. but I'm still working on a pull request related to varints/protobuf, I guess I'll have more luck there :)

leader0001 commented 7 months ago

We can't realistically switch the default charset to UTF-8 because that breaks binary data. Maybe we just need to throw an explicit error in this case and say "Please select whether this is binary data or some other charset?".

@a3957273 Could you give me an example of an error that would occur if you set 'UTF-8' as the default option? I've been doing tests, and when given a byte sequence that cannot be encoded in UTF-8, it simply encodes it as it would with the 'Raw Bytes' option.

Captura de pantalla 2024-04-03 135506

a3957273 commented 7 months ago

@zb3

when dragging a text file that is encoded in a specific encoding, input encoding could be set to that encoding

I don't believe there's a way to get the encoding from a dropped file. The operating system has no concept of the 'encoding' of a file. The only way to determine is via heuristics.

@leader0001 I'll be honest, my head is hurting with this issue. I could absolutely be wrong. I made a random 1024 byte file (not a valid ZIP):

random-file.zip

Treating 'md5sum' as the 'source of truth', I think the hash of this file is:

$ md5sum random-file 
805f52999697643dca04cd596f62f95d  random-file

If I encode it in CyberChef as 'Raw Bytes', I get the correct answer ([Good CyberChef](https://gchq.github.io/CyberChef/#recipe=MD5()&input=D1RivjXr4gr3g4lOlieeXyjfZSWQlHvw3yb/NwVMSXVd8LrwI454w40rdgsa8cD3stiyLE3qo5c4ZAP3E83AD9EkdP6DEaJ5vu8xaTAJpGqowu0CDZT80wu%2Bq0GAnRXSHIFNK45H0x1L6RKJf2dvvB12iTBBwRUhKzEg1c7XAQFXP7Mj9vRxVH5DyNnXLDXSFzETvzzGqZzb/hX/ghZ1h4dECZaDoUauZyGl5K16aTJAkaKuo2BdtlpkQuE59AdZajup8zJuMD3MzeflOk0lJ5lSQneQTYWxICeKxN5sTJN6NWbzSG4m/0O/2YXAHkqzHtO4DOg%2BX096pSryp0uxLIT5/u1gktNDUhReOnz8Qj4cuR2%2BtqzQSb8Ic74u4dEY2GS18dP0%2Bdzc/PeifXzIzSdUSrgdxyIOexhphpmQ0PlzIEZbTgc/AJ/mCdTz2OLAeyQjwifadz442s%2B5otzjCw2pEsGkktiZs9zR%2BaeaCbmnxco8ouiEmki0gIpPWAiA4Hj0kJ9mSYsdSo2Egw102VfmVBEANaa1%2B1VklCCg4R6bHhBbIjlteNyulp2vWOmnAnn9%2B9EBorGtaQxNH9EJ7QZFtcRoDcXQhWvH7S4rbYzrBCs0RInkf2odTeB5S/CxuGFxGWlz0s4%2BYwEmCE6NK7sJGKqAc5v8%2BP9v5ptoms4cNXP9fdop6hJSkOhC5%2BDCITZh7qBEY/JLPLNgG/KsbnjbPghuV8JDgM5T1uf24tr2xg9MOoxT9nUR41wuo/aFLLwXpS8HX907/74gJ696hm7/3EtdLUDkv15Uh3WGT1Kwfgi5cDrNyvqWftSi2uG8XlUXVES/jrwAbFNxKvbzsTYk6Xi5TkZwTLbLHchIwpUg1DMsCQbpTsmGofSDw3SLI/EHt7u7X64xKi4ytdHQ8BwzkEx7AmXmOx7zK/oPFG7MWW9ANCLjtY8xcy2KdW4wtXkR0eqQDDHXJdImn4M0mGASgNvo6st3VFV5NKKwJYbiIu6480SRE59g44cccswo1Gon1ZIymGm4Yfjbmo0f796KfuMD%2BoI2E5Z6o0HkqLHBj9JeJNAtS0GYtiJijfEyjRw9TRbS/fxwIy7JOwS4KBNiXeNfZaflvHsxnWl5jd%2BwMOcHHNOS6hpSAPlSwaBbsni0qcaHeOY3NQd2U4aDH0UlfkJD8k5QNUtFn19AW%2BKDgV6zdVIufSJrHD9it0ezTmfglpNIHJ4d4U4%2BWMIqL7uHP8hi2ZE2sVPzjfmxK9QjOeN1Nx1qf%2BTuNpz39vPRjo3lvS05vUTFHywiz%2B2OREJxUTKx%2B02k5i8/%2BF%2B19SKiB7bveQUoy7IdnhkUqR31O4NV5YA/B/4JXBeUcUGWEA)). If I encode it in CyberChef as 'UTF-8', I get the wrong answer ([Bad CyberChef](https://gchq.github.io/CyberChef/#recipe=MD5()&input=D1Riwr41w6vDogrDt8KDwolOwpYnwp5fKMOfZSXCkMKUe8Oww58mw783BUxJdV3DsMK6w7Ajwo54w4PCjSt2CxrDscOAw7fCssOYwrIsTcOqwqPClzhkA8O3E8ONw4APw5EkdMO%2BwoMRwqJ5wr7DrzFpMAnCpGrCqMOCw60CDcKUw7zDkwvCvsKrQcKAwp0Vw5IcwoFNK8KOR8OTHUvDqRLCiX9nb8K8HXbCiTBBw4EVISsxIMOVw47DlwEBVz/CsyPDtsO0cVR%2BQ8OIw5nDlyw1w5IXMRPCvzzDhsKpwpzDm8O%2BFcO/woIWdcKHwodECcKWwoPCoUbCrmchwqXDpMKtemkyQMKRwqLCrsKjYF3CtlpkQsOhOcO0B1lqO8Kpw7MybjA9w4zDjcOnw6U6TSUnwplSQnfCkE3ChcKxICfCisOEw55sTMKTejVmw7NIbibDv0PCv8OZwoXDgB5KwrMew5PCuAzDqD5fT3rCpSrDssKnS8KxLMKEw7nDvsOtYMKSw5NDUhReOnzDvEI%2BHMK5HcK%2BwrbCrMOQScK/CHPCvi7DocORGMOYZMK1w7HDk8O0w7nDnMOcw7zDt8KifXzDiMONJ1RKwrgdw4ciDnsYacKGwpnCkMOQw7lzIEZbTgc/AMKfw6YJw5TDs8OYw6LDgHskI8OCJ8Oadz44w5rDj8K5wqLDnMOjCw3CqRLDgcKkwpLDmMKZwrPDnMORw7nCp8KaCcK5wqfDhcOKPMKiw6jChMKaSMK0woDCik9YCMKAw6B4w7TCkMKfZknCix1Kwo3ChMKDDXTDmVfDplQRADXCpsK1w7tVZMKUIMKgw6EewpseEFsiOW14w5zCrsKWwp3Cr1jDqcKnAnnDvcO7w5EBwqLCscKtaQxNH8ORCcOtBkXCtcOEaA3DhcOQwoVrw4fDrS4rbcKMw6sEKzREwonDpH9qHU3DoHlLw7DCscK4YXEZaXPDksOOPmMBJghOwo0rwrsJGMKqwoBzwpvDvMO4w79vw6bCm2jCmsOOHDVzw719w5opw6oSUsKQw6hCw6fDoMOCITZhw67CoERjw7JLPMKzYBvDssKsbnjDmz4IblfDgkPCgMOOU8OWw6fDtsOiw5rDtsOGD0w6woxTw7Z1EcOjXC7Co8O2woUswrwXwqUvB1/DnTvDv8K%2BICfCr3rChm7Dv8OcS10tQMOkwr9eVMKHdcKGT1LCsH4IwrlwOsONw4rDusKWfsOUwqLDmsOhwrxeVRdURMK/wo7CvABsU3Eqw7bDs8KxNiTDqXjCuU5GcEzCtsOLHcOISMOCwpUgw5QzLAkGw6lOw4nChsKhw7TCg8ODdMKLI8OxB8K3wrvCu1/CrjEqLjLCtcORw5DDsBwzwpBMewJlw6Y7HsOzK8O6DxRuw4xZb0A0IsOjwrXCjzFzLcKKdW4wwrV5EcORw6rCkAwxw5clw5Imwp/CgzTCmGASwoDDm8Oow6rDi3dUVXk0wqLCsCXChsOiIsOuwrjDs0TCkRPCn2DDo8KHHHLDjCjDlGonw5XCkjLCmGnCuGHDuMObwprCjR/Dr8Oewop%2Bw6MDw7rCgjYTwpZ6wqNBw6TCqMKxw4HCj8OSXiTDkC1LQcKYwrYiYsKNw7Eywo0cPU0Ww5LDvcO8cCMuw4k7BMK4KBNiXcOjX2XCp8Olwrx7McKdaXnCjcOfwrAww6cHHMOTwpLDqhpSAMO5UsOBwqBbwrJ4wrTCqcOGwod4w6Y3NQd2U8KGwoMfRSV%2BQkPDsk5QNUtFwp9fQFvDosKDwoFewrN1Ui59ImscP2LCt0fCs05nw6DClsKTSBzCnh3DoU4%2BWMOCKi/Cu8KHP8OIYsOZwpE2wrFTw7PCjcO5wrErw5QjOcOjdTcdan/DpMOuNsKcw7fDtsOzw5HCjsKNw6XCvS05wr1Ew4UfLCLDj8Otwo5EQnFRMsKxw7tNwqTDpi8/w7hfwrXDtSLCogfCtsOveQUow4vCsh3CnhkUwqkdw7U7woNVw6XCgD8Hw74JXBfClHFBwpYQ&ienc=65001)).

Based off of this, I'm fairly confident we can't just default to 'UTF-8' in all cases.

zb3 commented 7 months ago

I don't believe there's a way to get the encoding from a dropped file. The operating system has no concept of the 'encoding' of a file. The only way to determine is via heuristics.

@a3957273, Ah, my bad. I probably meant adjusting the encoding manually, but I confused many things there, let's ignore that message...

If I encode it in CyberChef as 'Raw Bytes', I get the correct answer (Good CyberChef). If I encode it in CyberChef as 'UTF-8', I get the wrong answer (Bad CyberChef).

But wait.. are you talking about the case where you drop the file? Dropping files is a different story, you read bytes from the file as opposed to codepoints from the input! (the file is read using readAsArrayBuffer by LoaderWorker [unless CodeMirror reads it as text like I wrote in my last message]) Which means it's reasonable to assume that if the input is populated from file bytes then it contains raw bytes in that case. However, when pasting text it's probably not reasonable to assume they're raw bytes by default.. unless I'm missing some imortant use-case

a3957273 commented 7 months ago

No, I don't think you're missing anything @zb3. Can anyone see a problem with the following?

From a file? Raw bytes.
From drag and drop? Raw bytes.
From text? UTF-8.

leader0001 commented 7 months ago

No, I don't think you're missing anything @zb3. Can anyone see a problem with the following?

From a file? Raw bytes.

From drag and drop? Raw bytes.

From text? UTF-8.

My opinion is that it's a nice solution.

gchq / CyberChef

Hashing recipes use Latin-1 as default encoding #1669