LiMinggang / madedit-mod

MadEdit-Mod is a cross platform Text/Hex editor(based on the madedit project @ sourceforge)
GNU General Public License v3.0
126 stars 27 forks source link

EBCDIC #321

Closed Johan-Ekdahl closed 2 years ago

Johan-Ekdahl commented 2 years ago

Please provide the following information

Madedit-Mod version (or branch): 0.5.0a2 platform/architecture: WinX64 compiler and compiler version:

please describe what symptom you see, what you would expect to see instead and how to reproduce it. Is there any chance to get support for EBCDIC-encoding (EBCDIC 278) or some kind feature where you can load/use a user defined codepage?

LiMinggang commented 2 years ago

Not sure. The encoding support in MadEdit is mainly from OS running it or backported from other encoding support package such as libiconv.

Johan-Ekdahl commented 2 years ago

Ok, but it would be nice if you could take into consideration. Another feature that would be really nice is full support for big files in text mode, than I could use MadEdit instead of the obsolete application (Vedit) that I'm using today.

LiMinggang commented 2 years ago

"full support for big files" is on the way. Browse, jump to line/position has been supported in the alpha release. The only left is find text in the file but you can use find hex instead.

I did not find EBCDIC support in libiconv. Please let me know if you find any open source solution for it.

Johan-Ekdahl commented 2 years ago

I googled for some kind of encoding package with EBCDIC support and found this: https://github.com/limes-datentechnik-gmbh/libiconv, I don't know if this is something you could use?

LiMinggang commented 2 years ago

Wow. Looks promising. Actually, the gb18030 in MadEdit is actually from libiconv. So, it's possible to back port more encoding from it. BTW, why you need such an old encoding support? And I need a file to test.

LiMinggang commented 2 years ago

EBCDIC 278 is IBM 1143 in the libiconv you mentioned. IBM code page 278 (CCSID 278) is an EBCDIC code page with full Latin-1-charset used in IBM mainframes in Finland and Sweden.Code page 1143 (CCSID 1143) is the euro currency update of code page/CCSID 278. Byte 5A is replacing ¤ with € in that code page....

Johan-Ekdahl commented 2 years ago

Thank you for adding EBCDIC, I have attached a small file for testing. I need EBCDIC support because I sometimes have to handle information from old mainframes. small_file_EBCDIC.txt

LiMinggang commented 2 years ago

Hi @Johan-Ekdahl , the feature to support EBCDIC is almost done. Here is something related to MadEdit itself. eg. how to detect the exact encoding of those IBM stuff, there are 30 encodings.

I need some unique chars that only exist in one encoding so that the app could choose it for you. Eg, if it found [0x81~0xFE][0x30~0x39] in first parts of the file, it will use GB18030 as encoding.

Do you have any idea or suggestion?

LiMinggang commented 2 years ago

Is this correct? image

Johan-Ekdahl commented 2 years ago

Could you upload that file again, I uploaded two different files with the same name. The one that shows in the screen dump I have deleted on my local computer and is not the same that I can download from here.

Johan-Ekdahl commented 2 years ago

When it comes to unique characters in the encodings I think that it is not as simple as with GB18030. All EBCDIC-encodings use the same hex span, from 00 to FF and the difference is just the placement of local characters. An A is almost always coded hex C1 but as an example the nordic letter Ä is in EBCDIC 278 (Finland\Sweden] coded as hex 7B but in EBCDIC 277 (Denmark/Norway) it's coded as hex 63.

LiMinggang commented 2 years ago

image

LiMinggang commented 2 years ago

Not sure why there are lots of '{' in the text. BTW, all EBCDIC characters are one byte characters, right ? I saw this image which I don't understand.

LiMinggang commented 2 years ago

And looks it's hard to do auto detection because the lib don't support it. https://gitlab.freedesktop.org/uchardet/uchardet

Johan-Ekdahl commented 2 years ago

EBCDIC Here is the EBCDIC file, opened in my current editor.

Johan-Ekdahl commented 2 years ago

The auto detect is not an key feature for me so it's ok. And yes, EBCDIC are only one byte characters.

LiMinggang commented 2 years ago

Actually, this is why I think atuo detect is good. :) Otherwise, you have to know the exact encoding. image

LiMinggang commented 2 years ago

You may try 0.5.0a3

Johan-Ekdahl commented 2 years ago

Thank you, seems to work perfect. I have played around a bit and it works like a charm.

LiMinggang commented 2 years ago

I googled if EBCDIC can be auto detected but failed to find a solution. Will let it be. And let me know if you have mapping from IBM???? to EBCDIC-???. I can change the names.

Johan-Ekdahl commented 2 years ago

I made you an list of the names of the IBM-encodings, had to use multiple sources to find them all. I hope you can use it. IBM_EBCDIC.csv

LiMinggang commented 2 years ago

Do you think something like "IBM1025/Cyrillic" is better than "IBM1025"? Or anything else if you think it's better?

LiMinggang commented 2 years ago

IBM1025/Cyrillic IBM1047/Latin-1 IBM1097/Farsi Bilingual IBM1112/Baltic IBM1122/Estonia IBM1123/Cyrillic Ukraine IBM1130/Vietnamese IBM1132/Lao IBM1137/Devanagari IBM1140/USA IBM1141/Austria IBM1142/Denmark IBM1143/Finland IBM1144/Italy IBM1145/Spain IBM1146/UK IBM1147/France IBM1148/International IBM1149/Icelandic IBM1153/Latin-2 IBM1154/Cyrillic Multilingual IBM1155/Turkey IBM1156/Baltic IBM1157/Estonia IBM1158/Cyrillic Ukraine IBM1160/Thai IBM1164/Vietnamese IBM1166/Cyrillic Kazakh IBM12712/Hebrew IBM16804/Arabic

Johan-Ekdahl commented 2 years ago

I think "IBM1025/Cyrillic" is way better than just "IBM1025". The list above with the proposed names is good enough, I gather it's anyhow impossible to squeeze in the full names of character encodings in the dropdown list.

LiMinggang commented 2 years ago

Yeah. The final solution for this dilemma is that I'll keep the IBM1025 in the dropdown for it's shorter and use the long name in the menu items. User would get a good grouping and naming in menu list and shortcut in the dropdown if s/he knows the details.