This is a TypeScript/JavaScript library for Node.js (not browser) that weaves together a Japanese language learner-oriented Japanese NLP (natural language processing) pipeline using the following technologies:
In practical terms, this library will take a sentence like this:
へましたらリーダーに切られるだけ
and give you the following:
|
bunsetsu boundaryes)へましたら━━┓
リーダーに━━┫
切られるだけ
Tara
form,ReruRareru
+ Dictionary
form, andDictionary
form;All of the above information is returned as a JavaScript object or in JSON (if accessed by the built-in web server).
As you can tell from the above, Curtiz gives you a lot of information that might be related to your text but might not be. There are two reasons for this:
たら
aren't sensible because they're either for た
, which brings us toFirst, make sure you have Git and Node.js installed (any recent version is fine).
Then install MeCab, Unidic, and J.DepP. MeCab and Unidic are easy to install on macOS via Homebrew, but J.DepP is a "normal" old-school Unix C++ build (./configure --with-mecab-dict=UNI && make
…; ./configure --help
is useful and explains what with-mecab-dict
is doing) and if you've never built such a project before, do your best to follow the instructions and open an issue if you need help.
Then, download the followed required files (TODO: automatically download these!):
jmdict-eng-*.json
from JMdict-SimplifiedJmdictFurigana.json
from JMdict-Furiganakanjidic2.xml.gz
from KanjidicIf you already have your own Node.js project, install Curtiz as a dependency:
npm i https://github.com/fasiha/curtiz-japanese-nlp
Drop the three dependency files above into your project and skip to the "API" section below.
If you plan to interact with Curtiz just through a JSON web server, the easiest thing to do is to just set up a mini-Node.js package that'll spin up the server:
mkdir CURTIZ
to make a new directory, name it CURTIZ
but please change thiscd CURTIZ
to enter the new directorynpm init -y
will initialize an empty Node.js packagenpm i https://github.com/fasiha/curtiz-japanese-nlp
will install Curtiz as a dependencynpx curtiz-annotate-server
will start the webserver on http://127.0.0.1:8133 (you can pick another port, for example 8888, via PORT=8888 npx curtiz-annotate-server
)jmdict-simplified
, you can specify the one to use with an environment variable JMDICT_SIMPLIFIED_JSON=./jmdict-eng-3.5.0.json npx curtiz-annotate-server
. Environment variables stack so you can provide both this and the port: PORT=8888 JMDICT_SIMPLIFIED_JSON=./jmdict-eng-3.5.0.json npx curtiz-annotate-server
)The first time you run this, it will take several seconds while it builds a Leveldb cache of JMdict.
Now you're ready to hit a REST endpoint. The following will ask curl
to POST a Japanese sentence in a specific JSON structure to the appropriate endpoint, and save the result to curtiz.json
:
curl --data '{"sentence": "へましたらリーダーに切られるだけ"}' \
-X POST \
-H "Content-Type: application/json" \
-o curtiz.json \
http://127.0.0.1:8133/api/v1/sentence
As described below, I need to formally describe the structure of this data. In the meantime, please check the tests and the TypeScript interfaces, especially the v1ResSentenceAnalyzed
type, to see what data is where.
In your Node project, create a new file (either TypeScript demo.ts
or ESM demo.mjs
). Put the following code into it to import and exercise the package:
// TypeScript or ESM (e.g., `demo.ts` or `demo.mjs`)
import * as curtiz from 'curtiz-japanese-nlp';
curtiz.handleSentence('それは昨日のことちゃった').then(result => console.dir(result, {depth: null}));
(If you're using TypeScript, (1) make sure you compile this, e.g.,
npx tsc -p .
and run the resultingdemo.js
. Also (2), you may need yourtsconfig.json
to"target": "es2019"
or later.)Make sure you have the three dependency files above in your project head (JMdict-Furigana, JMdict-Simplified, and Kanjidic). The first time you run this, Curtiz will spend several seconds building a Leveldb cache for JMdict and will log its progress.
Note that because Leveldb is not multithreaded, you can't run this if you're also running the web server above 😒. If you see an error like
Error [OpenError]: IO error: lock jmdict-simplified/LOCK: Resource temporarily unavailable
, this is Leveldb complaining that some other process has a lock on the database. I should fix this…
This will print out a lot of text, but it will show you everything that Curtiz has done with the sentence.
More details forthcoming but please check the tests and the TypeScript interfaces, especially the v1ResSentenceAnalyzed
type, to see what data is where.
Often it can be very helpful to inspect the output of MeCab-Unidic to understand what this module is doing. This library incldues a thin wrapper that translates Unidic parts-of-speech, conjugations, inflections, etc., into English (via tables 1, 2, 3, published by GitHub user @masayu-a citing the work of Dr Irena Srdanovic), and exposes a command-line interface: simply pipe multi-line input into mecabUnidic.js
, for example cat text | ./mecabUnidic.js
or equivalently cat text | node mecabUnidic.js
. A simple example on the command-line:
cat <<EOF | ./mecabUnidic.js
「ほら、
あれが小学校だよ。」
EOF
This will print out the following Markdown table:
Literal | Pron. | Lemma Read. | Lemma | PoS | Infl. Type | Infl. |
---|---|---|---|---|---|---|
ほら | ホラ | ホラ | ほら | interjection-general | ||
、 | 、 | supplementary_symbol-comma | ||||
あれ | アレ | アレ | 彼れ | pronoun | ||
が | ガ | ガ | が | particle-case | ||
小 | ショー | ショウ | 小 | prefix | ||
学校 | ガッコー | ガッコウ | 学校 | noun-common-general | ||
だ | ダ | ダ | だ | auxiliary_verb | auxiliary-da | conclusive-general |
よ | ヨ | ヨ | よ | particle-phrase_final | ||
。 | 。 | supplementary_symbol-period | ||||
」 | 」 | supplementary_symbol-bracket_open | ||||