Open goodmami opened 6 years ago
It looks like only tmt.tdl
is actually being used in the grammar. The rest are commented out. If they are unused, maybe we could remove them. But if they still have value they should be re-encoded.
And actually tmt.tdl
was miscategorized by nkf
. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl
. Many of the lex/*.tdl
files are in fact EUC-JP, though; their mode lines even specify them as such.
I think we are not loading any of the lex/*.tdl files, so rather than converting them we should probably delete them.
On Wed, Dec 13, 2017 at 3:38 AM, Michael Wayne Goodman < notifications@github.com> wrote:
And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify them as such.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/delph-in/jacy/issues/57#issuecomment-351169996, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xqUFOyJi6uEpQQruyx0fD1AX2h0Oks5s_tYlgaJpZM4Q_gHJ .
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
We are still importing tanaka-unknowns.tdl
, which has almost 7k lexical entries. It is UTF-8.
;;; from japanese.tdl:
:begin :instance :status lex-entry.
:include "lexicon.tdl".
:include "lex/tanaka-unknowns.tdl". ; <-- here
; :include "lex/adjadv-lex.tdl".
; :include "lex/aux-stem-lex.tdl".
; :include "lex/funct-lex.tdl".
; :include "lex/idiom-lex.tdl".
; :include "lex/light-verbs-lex.tdl".
; :include "lex/noun-lex.tdl".
; :include "lex/numbers-lex.tdl".
; :include "lex/p-lex.tdl".
; :include "lex/pn-lex.tdl".
; :include "lex/verbstem-lex.tdl".
; :include "lex/vn-lex.tdl".
; :include "lex/v-ends-lex.tdl".
; :include "lex/ambiguous-lex.tdl".
:end :instance.
There are also .rev
, .blacklist
, and a few other files types. I'll let you deal with deleting the files since I'm not sure what is valuable to keep.
There are mix of encodings in Jacy's files:
The iso-8859-1 ones are probably EUC-JP and not Latin-1. The
nkf
utility (probably not installed by default:apt install nkf
) can guess this for us. Now, just looking at TDL files:There's even a Shift-JIS one in there. Here's the non-UTF-8 and non-ASCII files:
We should make these all UTF-8 (or ASCII is fine if there's no Japanese or special characters)