The charset is wrongly detected

wanyancan commented 5 years ago

Hi,

Is there any way to manually set charset for opened files? If not, how may I change source code of the auto-detection to manual selection?

Thank you!

HouQiming commented 5 years ago

I haven't implemented encoding selection.

Right now, the JS function EDLoader_Open calls the JC function DetectEncoding to detect the encoding when reading the first chunk of a file. There is no easy data path from UI to that place... but you can always add a hard-coded rule based on the file name.

A likely cause of mis-detection is MAX_ENCODING_DETECTION_LENGTH in encoding.jc. Right now qpad only checks the first 8KB of a file for encoding and will assume UTF8 if it's all ASCII. Maybe you can try increasing that?

Also, can you share some detail about your file's content? Maybe I could improve the model.

On Fri, Nov 23, 2018 at 12:43 PM wanyancan notifications@github.com wrote:

Hi,

Is there any way to manually set charset for opened files? If not, how may I change source code of the auto-detection to manual selection?

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HouQiming/qpad/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/AHrIF2LuYqFUvd5KzTGQtVTAnhzLyeSBks5ux3zogaJpZM4YwG5v .

wanyancan commented 5 years ago

The file contains some engineering symbols in CP936. Designator Footprint Mid X Mid Y Ref X Ref Y Pad X Pad Y TB Rotation Comment R1 0402_R 817.716mil -5537.402mil 817.716mil -5537.401mil 829.548mil -5525.57mil T 225.00 10KΩ (1002) ±1%

Ω (A6 B8) and ±(A1 C0) are treated as separated ｡(A1) ﾀ(C0) and ｦ(A6) ｸ(B8).

In Cp932, from A1 to DF they are all single character but can be combined in CP936 as one character.

I'm not sure how the model can be updated. Maybe use two token score with penalty on the consecutive chars in range A1 to DF ?

I believe a manual selection in menu is the most convenient. Can I call ConvertToUTF8(encoding, s) directly?

HouQiming commented 5 years ago

I see. It's indeed possible to call ConvertToUTF directly.

Here I can't really improve the model... since I also need to detect half-width katakana documents which has the same structure but replaces Ω with things like ｵﾒｶﾞ.

In any case, I recommend UTF-8 and I'll add manual encoding selection to the to-do list.

Qiming

On Fri, Nov 23, 2018 at 3:00 PM wanyancan notifications@github.com wrote:

The file contains some engineering symbols in CP936. Designator Footprint Mid X Mid Y Ref X Ref Y Pad X Pad Y TB Rotation Comment R1 0402_R 817.716mil -5537.402mil 817.716mil -5537.401mil 829.548mil -5525.57mil T 225.00 10KΩ (1002) ±1%

Ω (A6 B8) and ±(A1 C0) are treated as separated ｡(A1) ﾀ(C0) and ｦ(A6) ｸ(B8).

In Cp932, from A1 to DF they are all single character but can be combined in CP936 as one character.

I'm not sure how the model can be updated. Maybe use two token score with penalty on the consecutive chars in range A1 to DF ?

I believe a manual selection in menu is the most convenient. Can I call ConvertToUTF8(encoding, s) directly?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/HouQiming/qpad/issues/4#issuecomment-441167072, or mute the thread https://github.com/notifications/unsubscribe-auth/AHrIF44bqQGtt-VXNqAZPAOjBhnAEOxbks5ux50fgaJpZM4YwG5v .

HouQiming / qpad

The charset is wrongly detected #4