[Feature] Extend character encoding support (Data Inspector, Patterns)

arcusmaximus commented 1 year ago

What feature would you like to see?

ImHex can already load custom encodings for use in the hex editor. It would be nice to see this support extended to the other areas of the tool:

Data Inspector: use the custom encoding for the "String" field.
Pattern engine: use the custom encoding for displaying and editing char[] arrays (and converting them to str when returned from functions and such).
As a potential extra, add support for specifying encodings right in the pattern file, so people who download and use the pattern don't have to load a custom encoding by hand. This could be stuff like:
- A #pragma to set a default encoding
- An attribute for setting encodings on individual fields
- An overload of std::mem::read_string() that accepts an encoding as the third argument
- std::string functions for converting between char[] and str using specific encodings

I'm attaching a sample (a binary UI definition file that uses Shift-JIS) along with two pattern files, one for Kaitai Struct with SJIS support (can be executed on https://ide.kaitai.io/) and one for ImHex. spm.zip

How will this feature be useful to you and others?

Japanese PC games are notorious for using Shift-JIS - not just in their .exe's but also in their many custom file formats. Having wider encoding support would certainly help with analyzing/modding these.

Request Type

[ ] I can provide a PoC for this feature or am willing to work on it myself and submit a PR

Additional context?

No response

github-actions[bot] commented 3 months ago

This issue is marked stale as it has been open for 11 months without activity. Please try the latest ImHex version. (Avaiable here: https://imhex.download/ for release and https://imhex.download/#nightly for development version) If the issue persists on the latest version, please make a comment on this issue again

Without response, this issue will be closed in one month.

arcusmaximus commented 3 months ago

Still applies.

paxcut commented 3 months ago

Please note that the support for variable length encodings is iffy at best. To implement a fully functional variable length decoder one needs more than just a table that maps sets of bytes. In the naive implementation ImHex currently uses it must assume that: 1) The input file is correctly encoded. 2) The input file starts at a valid encoded set of bytes.

Any decoding of any part of the file must start at this beginning set of bytes. So if you open a biggish file and scroll to some high address ImHex has to decode the entire file from the start before it can decode the part that was scrolled to. This is a limitation of using a table as a means of describing the encoding, not a limitation of the many variable length encodings that exist. To implement that correctly the decoding scheme must include error detection, handling and recovery and most importantly the ability to decode partial fragments even if they don't start at a valid sets of bytes. Adding anything other than single byte decoding to other parts of ImHex requires full implementations of decoding algorithms for selected encodings as it is impossible to write universal variable length decoders for obvious reasons.

arcusmaximus commented 3 months ago

To be clear: this ticket is not about text files, but about binary files that contain Shift-JIS strings here and there. Decoding the entire file as one big string is impossible and unnecessary.

ImHex can already decode a 16-byte string at each 16-byte-aligned position (in the main "Hex editor" panel). The request here is for the Data Inspector and pattern engine to decode arbitrary-length strings at user-specified positions.

paxcut commented 3 months ago

The decoding column shows a character for every encoded set it finds starting from the start. It is the only way to use custom encodings. The 16 bit decoder is a fixed length decoding but it still will decode every two bytes of the input file. With fixed length encodings you can scroll to anywhere in the file and it will start decoding from there. With variable length encodings (where 1,2,3,.. bytes can encode one char), like Shift-JIS, it needs to start decoding from the beginning of the file always because it has no other way to know where the next char is going to be.

arcusmaximus commented 3 months ago

Please read my previous comment again and take a look at the sample file with Kaitai.

paxcut commented 3 months ago

I did as you asked and gave up on kaitai after 30 minutes trying to see what it could be doing. I loaded your file in imhex and ran the pattern. I can see the strings in what I assume is valid Japanese (not a speaker) using the shift-jis encoding table. The encoding is robust enough that trying to break it randomly was not possible, so I turned to wikipedia to find exactly what I wanted to convey but using the encoding in question as an example

begin quote

Shift JIS can be used in string literals in programming languages such as C, but a few things must be taken into consideration. Firstly, that the escape character 0x5C, normally backslash, is the half-width yen sign (¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports Shift JIS output. Secondly, the 0x5C byte will cause problems when it appears as second byte of a two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C.

end quote

As it is explained there needs to be more than just a table to decode arbitrary sets of bytes that may or may not be valid encoded shift-jis strings. Finding similar special handling cases would need to be done for each variable size encoding that wants to be added to the pattern language or the data inspector. The reason why it always seems to work in the decoding column is because Imhex is processing the file as a whole and decoding the entire file which is designed to work as long as there are no omissions or insertions. For shift-jis it may be good enough to just wing it but other encodings are not so forgiving and the rules for verification and self correction can be very complex. I have looked into this in the past several times and I am not the only one. The current consensus is that variable length encodings are not in what one may call a supported state. As a variable length encoding shift-jis isat the lower end of complexity (1 or 2 bytes) so it is not a good example to use to demonstrate how a generalize variable length encoding needs to be handled.

arcusmaximus commented 3 months ago

Yes, decoding a string starting from the middle of a multibyte character results in garbage... But I still don't see how that's a problem for this ticket.

With the Data Inspector and char[] patterns, it's up to the user to specify the starting location of a valid string. They should start decoding at this exact location, independent of what the decode column is showing. If the starting location is in the middle of a character, I don't *want* ImHex to try to automatically fix it - I want to see garbage so that I know there's a problem.

The same thing goes for corrupted strings: if ImHex were to automatically recover here, it would only be hiding problems that I want to see. It should behave in the same way as the application for which the file is intended: a straightforward decode without any fancy heuristics. Garbage in, garbage out.

The decode column is out of scope here. I only brought it up to illustrate that ImHex already has decoding capabilities.

(Finally, here's a Kaitai screenshot for reference, showing a correctly decoded string inside the pattern tree. No need to jump to the location in the hex view and see a potentially incorrect decoding there:)

paxcut commented 3 months ago

You have misunderstood my comment. I simply am explaining that it is not a straight forward addition. New code needs to added and it is not clear what the code would be for arbitrary encodings which is what the feature request is asking for. As encodings go shift-hjis is onr of the simplest ones so using it as an example of how to handle any encoding does not really help. Even for shift-jis it isn't enough to use a table because that doesn't code for the special cases. So the extension would have to be done only for specific encodings that are understood and not for any encoding that has a table.

arcusmaximus commented 3 months ago

What I don't understand is why a simple lookup table would *not* be sufficient if we assume the string is valid.

The "special case" you pasted for Shift-JIS doesn't apply here. It's bug you get with old C compilers that treat everything as ASCII. When doing things correctly, there's nothing noteworthy about the byte sequence 83 5C - it represents the highly common character ソ. ImHex's current table approach handles it just fine.

Of course, if we start talking about invalid sequences such as 83 31, then you would indeed need to have intrinsic knowledge about the encoding. A Shift-JIS-aware application would know that 83 marks the start of a multibyte sequence and read this as a single invalid character - while ImHex wouldn't find an entry for 83 31 in its table and instead read this as two characters (an invalid one followed by 1).

My point here is that I don't care about such invalid sequences. It doesn't matter if ImHex trips over 83 31 and garbles the whole remainder of the string, because in practice, you'll never encounter this sequence. If you want to make it more robust, sure, be my guest - but it's not necessary for this ticket.

paxcut commented 3 months ago

That works for shift-jis. the feature is requesting the addition of arbitrary encodings that have tables. How would you handle invalid/incomplete strings for them?

arcusmaximus commented 3 months ago

I already wrote two times that I don't care about how ImHex handles invalid strings.

The request here is only for the existing decoding functionality, as currently used in the hex view's decode column, to be made available in the Data Inspector and the pattern language. That's it.

HugoCortell commented 1 month ago

I, too, want imhex to support SJIS. Hex editing is messy, so who cares how it handles invalid strings, as long as I can get an idea of what exactly might be in the file...

Considering the logo for this tool, I am really surprised that it does not support SJIS...

kbugstar commented 1 month ago

Yep, took a long time to find how to make imhex show wide char in pattern data char16 data type, how to find wide char, it just can't.... so ,if you want find a wide char string out range of ascii, utf16,you need use another tool to find hex of the string first!!

WerWolv / ImHex