Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
216 stars 25 forks source link

First element of `token` array in a multi-morpheme lemma contains the reading for the whole lemma #26

Open fasiha opened 9 years ago

fasiha commented 9 years ago

Example: MeCab splits "休ませたら" into three morphemes, which Ve combines into one lemma. Ve's JSON-ified output contains the following:

...
    "tokens": [
      {
        "raw": "休ま\t動詞,自立,*,*,五段・マ行,未然形,休む,ヤスマ,ヤスマ",
        "type": "parsed",
        "literal": "休ま",
        "pos": "動詞",
        "pos2": "自立",
        "pos3": "*",
        "pos4": "*",
        "inflection_type": "五段・マ行",
        "inflection_form": "未然形",
        "lemma": "休む",
        "reading": "ヤスマセタラ",
        "hatsuon": "ヤスマセタラ",
        "characters": "0..1"
      },
...

Note how the reading and hatsuon fields for the first element of tokens contains the reading for the entire lemma, i.e., "ヤスマセタラ" instead of just "ヤスマ" as MeCab has (e.g., in raw).

I'm working around this oddity to build furigana, just wondering if it's intentional?

Kimtaro commented 9 years ago

Hmm, no, this does indeed seem like a bug :)