Feature request: add option to set file encoding detect list defined by user

kenifanying commented 1 year ago

Hi,

I often working with different encoding files, eg: utf-8, gbk etc. Currently, I use utf-8 as default encoding, but I have to reload file every time when working different encoding files. It will be more efficient If there is a option to set file encoding detect list defined by user.

Thanks

Alexey-T commented 1 year ago

You can set file encoding per file using

~~plugin Modeline (see its readme.txt file)~~ sorry, not this one
plugin File Type Profile (see readme.txt too)

kenifanying commented 1 year ago

You can set file encoding per file using

* ~plugin Modeline (see its readme.txt file)~ sorry, not this one

* plugin File Type Profile (see readme.txt too)

[Header]
Version=1.0

[BatchScript]
FileExts=.cmd;.bat;.nt
Encoding=

[ShellScript]
FileExts=.sh
EolFormat=lf

It's possible to set Encoding= option to multiple value and set FileExts= option to * to match all files?

bogen85 commented 1 year ago

You can set file encoding per file using
* ~plugin Modeline (see its readme.txt file)~ sorry, not this one

I could easily add it, if someone filed an issue and requested it.

https://github.com/bogen85/CudaText_modeline_plugin/issues

Alexey-T commented 1 year ago

It's possible to set Encoding= option to multiple value and set FileExts= option to * to match all files?

I don't know, sorry, please ask at the plugin's page https://github.com/dinkumoil/cuda_file_type_profile

kenifanying commented 1 year ago

It's possible to set Encoding= option to multiple value and set FileExts= option to * to match all files?

I don't know, sorry, please ask at the plugin's page https://github.com/dinkumoil/cuda_file_type_profile

It seems that this plugin is not what I want. CudaText's file encoding detect is not good enough compare to some other editor eg: notepad++.

I think it's a good start to add a option to set user defined file encoding detect list.

Alexey-T commented 1 year ago

CudaText's file encoding detect is not good enough compare to some other editor eg: notepad++.

Pls explain, why plugin is not enough? we will ask the author. or we can change it ourselves.

kenifanying commented 1 year ago

CudaText's file encoding detect is not good enough compare to some other editor eg: notepad++.

Pls explain, why plugin is not enough? we will ask the author. or we can change it ourselves.

https://wiki.freepascal.org/CudaText#Encoding_detection

We may be meet these encodings in China: ucs-bom,utf-8,gb2312, gbk(cp936) ,gb18030, big5, euc-jp, euc-kr, etc.

Generally, we will set CudaText's default encoding to utf-8 in linux environment, but it always need reload file with correctly encoding because of the fail detect file encoding when open file encode with 'gbk, big5 ...etc.'

Alexey-T commented 1 year ago

Maybe plugin FileTypeProfile can improve this. missed CudaText encodings (Asian) is a problem. but logic - can be moved to plugin.

@dinkumoil I don't understand what is suggested by user, maybe you have an idea, and know how to change the plugin.

dinkumoil commented 1 year ago

I do not understand this request as well. With my plugin it is possible to configure one character encoding that should be used by CudaText when opening files with certain filename exensions (e.g. .bat or .cmd). If there is one encoding per filename extension this can be automated (for example by my plugin). If there is a list of encodings, user interaction is required to select one of the encoding contained in the list, i.e. automation is not possible. So, @kenifanying please be more specific what you want to achieve.

kenifanying commented 1 year ago

I do not understand this request as well. With my plugin it is possible to configure one character encoding that should be used by CudaText when opening files with certain filename exensions (e.g. .bat or .cmd). If there is one encoding per filename extension this can be automated (for example by my plugin). If there is a list of encodings, user interaction is required to select one of the encoding contained in the list, i.e. automation is not possible. So, @kenifanying please be more specific what you want to achieve.

Actually, I often need open same filename extension with different encoding such as .txt file created by windows user. Currently, CudaText has too many wrong encoding detection when working with non utf-8 file especially in CJK environment.

So, What I want is hope your guys can improve the encoding detection algorithm when open file. Add an option to set user defined encoding detect list, then Cudatext using this encoding detect list to guess file encoding in order if this option is enabled.

Thanks.

kenifanying commented 1 year ago

As a reference, sublime text has "fallback_encoding" option to set fallback encoding if detect encoding failed.

VIM has fileencodings option to set user defined list of character encodings considered when starting to edit an existing file.

gedit has candidate-encodings options in dconf settings.

Alexey-T commented 1 year ago

So what from 2 choices to use?

1. SublimeText-like option 'fallback encoding' which has value of ONE encoding name
1. gedit-like array. array is strange. which encoding from this array does app choose when it cannot detect encoding for file?

kenifanying commented 1 year ago

So what from 2 choices to use?

I think choose the way which gedit or vim use is more reasonable.

SublimeText-like option 'fallback encoding' which has value of ONE encoding name

gedit-like array. array is strange. which encoding from this array does app choose when it cannot detect encoding for file?

May be use the last one or use utf-8 like VIM if all list failed.

Alexey-T commented 1 year ago

good idea. Instead of using 'ANSI' encoding, we may give the option "candidate_encodings" with array of names. first name, which don't make encoding errors, will be used.

kenifanying commented 1 year ago

good idea. Instead of using 'ANSI' encoding, we may give the option "candidate_encodings" with array of names. first name, which don't make encoding errors, will be used.

VIM's default "candidate_encodings" is "ucs-bom,utf-8,default,latin1", the default option depends on current locale.

Some text editor can "smart enough" to guess correctly most of encoding such as notepad++, notepad2 (zufuliu edition), but it may be need more work to do.

Alexey-T commented 1 year ago

According to wiki, UTF8 is detected by separate function:

  detect = file_detect_utf8_content
  // it can get 3 values: 
  //     UTF8_Unknown: only ASCII chars present
  //     UTF8_ok: correct UTF8, non-ASCII, chars present
  //     UTF8_broken: broken UTF8 chars present
  if detect == UTF8_ok then
    return(UTF8)
  if detect == UTF8_broken then
    enc = ANSI

so putting UTF8 to candidate_encodings makes no sense!

putting UTF16-BOM (in VIM it is ucs-bom, yes?) also makes no sense, because of '-BOM', only text with BOM can be detected. and text with BOM is detected in CudaText separately.

what makes sense in candidate_encodings? simple 1-byte encodings + Asian multibyte encodings. but first such encoding will be used! because any 1-byte and multi-byte encoding is valid for any content. so candidate_encodings needs only one item!!!

Alexey-T commented 1 year ago

Added option.

Windows beta (exe only). please test. http://uvviewsoft.com/c/

write new option to user.json by hands.

  //Encoding to use when auto-detection fails.
  //One of supported encoding names, or one of special values "ansi", "oem".
  //Value "ansi" means OS-dependant ANSI encoding: cp1250, cp1251, cp1252, cp1253, cp1254, cp1255,
  //cp1256, cp1257, cp1258, cp874, cp932, cp936, cp949, cp950.
  //Value "oem" means OS-dependant OEM encoding: cp437, cp850, cp852, cp866, cp874,
  //cp932, cp936, cp949, cp950.
  //UTF-8 / UTF-16 / UTF-32 variants are not allowed here.
  "fallback_encoding": "ansi",

Alexey-T commented 1 year ago

updated the beta-files. and changed option value allowed. comment above updated.

kenifanying commented 1 year ago

According to wiki, UTF8 is detected by separate function:

  detect = file_detect_utf8_content
  // it can get 3 values: 
  //     UTF8_Unknown: only ASCII chars present
  //     UTF8_ok: correct UTF8, non-ASCII, chars present
  //     UTF8_broken: broken UTF8 chars present
  if detect == UTF8_ok then
    return(UTF8)
  if detect == UTF8_broken then
    enc = ANSI

so putting UTF8 to candidate_encodings makes no sense!

Actually, What I want is to let CudaText allow user to totally define their own encoding detection list, that can meet different users need.

putting UTF16-BOM (in VIM it is ucs-bom, yes?) also makes no sense, because of '-BOM', only text with BOM can be detected. and text with BOM is detected in CudaText separately.

Same reason as above.

what makes sense in candidate_encodings? simple 1-byte encodings + Asian multibyte encodings. but first such encoding will be used! because any 1-byte and multi-byte encoding is valid for any content. so candidate_encodings needs only one item!!!

We need more than one item, for example:

As a Chinese, I mostly often edit gbk file, then shift-jis file, but for a Japanese, he/she maybe need shift-jis before gbk to avoid wrong encoding detection. Japanese use some Chinese characters too!

kenifanying commented 1 year ago

//UTF-8 / UTF-16 / UTF-32 variants are not allowed here. "fallback_encoding": "ansi",

So, your guys decide to use sublime text like option?

Alexey-T commented 1 year ago

I mostly often edit gbk file, then shift-jis file, but for a Japanese, he/she maybe need shift-jis before gbk to avoid wrong encoding detection. Japanese use some Chinese characters too!

so, CHS user can have "fallback_encoding": "....code for GBK..." and JP user can have "fallback_encoding": "...code for shift-jis...". Is it OK?

Alexey-T commented 1 year ago

decide to use sublime text like option?

yes, because one value is enough as it seems it me.

kenifanying commented 1 year ago

I mostly often edit gbk file, then shift-jis file, but for a Japanese, he/she maybe need shift-jis before gbk to avoid wrong encoding detection. Japanese use some Chinese characters too!

so, CHS user can have "fallback_encoding": "....code for GBK..." and JP user can have "fallback_encoding": "...code for shift-jis...". Is it OK?

No, It's still failed to detect encoding when CHS user set "fallback_encoding": "....code for GBK..." when open shift-jis coding file. But it's better than no "fallback_encoding" option

Alexey-T commented 1 year ago

The proper detection for all Unicode encodings is NOT done yet. so any file with 'broken utf8' will be detected as 'fallback encoding'

Alexey-T commented 1 year ago

If new option works (pls, test it), we can close this.

kenifanying commented 1 year ago

Added option.

Windows beta (exe only). please test. http://uvviewsoft.com/c/

I have test this beta build on Windows 11. It works, thanks.

Alexey-T / CudaText

Feature request: add option to set file encoding detect list defined by user #4693