Different encodings in Jacy files

goodmami commented 6 years ago

There are mix of encodings in Jacy's files:

~/grammars/jacy$ find . -type f -exec file -b --mime-encoding {} \; | sort | uniq -c
    206 binary
     20 iso-8859-1
    289 us-ascii
    432 utf-8

The iso-8859-1 ones are probably EUC-JP and not Latin-1. The nkf utility (probably not installed by default: apt install nkf) can guess this for us. Now, just looking at TDL files:

~/grammars/jacy$ find . -name \*.tdl -exec nkf -g {} \; | sort | uniq -c
     19 ASCII
     18 EUC-JP
      1 Shift_JIS
     16 UTF-8

There's even a Shift-JIS one in there. Here's the non-UTF-8 and non-ASCII files:

~/grammars/jacy$ find . -name \*.tdl -exec echo -en {} "\t" \; -exec nkf -g {} \; | sort -k2 | grep -v 'ASCII$\|UTF-8$'
./lex/adjadv-lex.tdl    EUC-JP
./lex/ambiguous-lex.tdl     EUC-JP
./lex/aux-stem-lex.tdl  EUC-JP
./lex/funct-lex.tdl     EUC-JP
./lex/idiom-kanyouku-lex.tdl    EUC-JP
./lex/idiom-lex.tdl     EUC-JP
./lex/light-verbs-lex.tdl   EUC-JP
./lex/noun-lex.tdl  EUC-JP
./lex/numbers-lex.tdl   EUC-JP
./lex/oldlexicon.tdl    EUC-JP
./lex/p-lex.tdl     EUC-JP
./lex/pn-lex.tdl    EUC-JP
./lex/v-ends-lex.tdl    EUC-JP
./lex/verbstem-lex.tdl  EUC-JP
./lex/vn-lex.tdl    EUC-JP
./tmr/class.tdl     EUC-JP
./tmr/ne2.tdl   EUC-JP
./tmt.tdl   EUC-JP
./tmr/ne1.tdl   Shift_JIS

We should make these all UTF-8 (or ASCII is fine if there's no Japanese or special characters)

goodmami commented 6 years ago

It looks like only tmt.tdl is actually being used in the grammar. The rest are commented out. If they are unused, maybe we could remove them. But if they still have value they should be re-encoded.

goodmami commented 6 years ago

And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify them as such.

fcbond commented 6 years ago

I think we are not loading any of the lex/*.tdl files, so rather than converting them we should probably delete them.

On Wed, Dec 13, 2017 at 3:38 AM, Michael Wayne Goodman < notifications@github.com> wrote:

And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify them as such.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/57#issuecomment-351169996, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xqUFOyJi6uEpQQruyx0fD1AX2h0Oks5s_tYlgaJpZM4Q_gHJ .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 6 years ago

We are still importing tanaka-unknowns.tdl, which has almost 7k lexical entries. It is UTF-8.

;;; from japanese.tdl:
:begin :instance :status lex-entry.
   :include "lexicon.tdl".
   :include "lex/tanaka-unknowns.tdl".  ; <-- here
;  :include "lex/adjadv-lex.tdl".
;  :include "lex/aux-stem-lex.tdl".
;  :include "lex/funct-lex.tdl".
;  :include "lex/idiom-lex.tdl".
;  :include "lex/light-verbs-lex.tdl".
;  :include "lex/noun-lex.tdl".
;  :include "lex/numbers-lex.tdl".
;  :include "lex/p-lex.tdl".
;  :include "lex/pn-lex.tdl".
;  :include "lex/verbstem-lex.tdl".
;  :include "lex/vn-lex.tdl".
;  :include "lex/v-ends-lex.tdl".
;  :include "lex/ambiguous-lex.tdl".
:end :instance.

There are also .rev, .blacklist, and a few other files types. I'll let you deal with deleting the files since I'm not sure what is valuable to keep.

delph-in / jacy

Different encodings in Jacy files #57