SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
541 stars 51 forks source link

Extra `\n` with `-O wakati` option #2

Closed chezou closed 6 years ago

chezou commented 8 years ago

Using mecab-python3 with -O wakati option, there is an extra end of line.

actual:

In [6]: m = MeCab.Tagger('-O wakati')
In [7]: m.parse("こんにちは、Python")
Out[7]: 'こんにちは 、 Python \n'

expected:

In [6]: m = MeCab.Tagger('-O wakati')
In [7]: m.parse("こんにちは、Python")
Out[7]: 'こんにちは 、 Python'
SamuraiT commented 8 years ago

this is actually expected! this is how the mecab itself works! just check the function of mecab in your terminal,

echo "こんにちは、Python" | mecab -O wakati > file

the result should be

こんにちは 、 Python \n
chezou commented 8 years ago

No, it's not authors intension to concatenate line break to every end of line .

echo "こんにちは、Python\n今日の天気は晴れです" | mecab -O wakati
こんにちは 、 Python
今日 の 天気 は 晴れ です

It seems to appear only last line to display and I think it should be stripped for parsed result.

zackw commented 6 years ago

This behavior comes from the core MeCab C library. I converted your test program to the C equivalent

#include <mecab.h>
#include <stdio.h>

int main(void)
{
  mecab_t *tagger = mecab_new2("-Owakati");
  if (!tagger) return 1;

  const char *out = mecab_sparse_tostr(tagger, "こんにちは、Python");
  fputs(out, stdout);
  return 0;
}

and I see a newline at the end of the string it prints:

$ gcc -O2 -Wall -g test.c -lmecab
$ ./a.out | hd 
00000000  e3 81 93 e3 82 93 e3 81  ab e3 81 a1 e3 81 af 20  |............... |
00000010  e3 80 81 20 50 79 74 68  6f 6e 20 0a              |... Python .|
0000001c

I do not think we should make the Python module behave differently from the C interface it wraps. Please take this up with the developers of MeCab itself.