atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Configuring with Maven #132

Closed Zurdge closed 4 years ago

Zurdge commented 4 years ago

Hi folks,

I'm aware this isn't quite specific to Kuromoji exactly, however, it is related to getting Kuromoji functioning so I hope this question is ok here.

The Setup

I'm not 100% familiar with Java and have followed some guides to spin up a quick Maven project.

I'm using simple CMD commands to get started as described on maven getting started.

I've included the dependencies into the pom.xml but noticed no Kuromoji related files were inside the packaged .jar. After some digging, I learnt about maven-assembly-plugin and seem to be collecting all the needed bits and pieces.

Issue

This is my output to CMD from the example code posted in the README.md

?       ???,????,*,*,*,*,?,?,?
??      ??,??,*,*,*,*,??,??,??
?       ??,???,??,*,*,*,?,?,?
??      ??,??,*,*,??,???,???,??,??
??      ???,*,*,*,?????,???,??,??,??
?       ??,??,*,*,*,*,?,?,?

Has anyone come across this kind of thing before?

I think this issue is down to me not being very familiar with Java. My next step is to move away from using CMD and set up the project in Eclipse (although their website is currently down 😦 )

Zurdge commented 4 years ago

Re-opening 🙈

I thought I'd solved my issue but sadly not.

Running the App in Eclipse outputs as expected.

The issue occurs after building so perhaps this is a maven thing with encoding...

JhonnySalles commented 4 years ago

I used the same file as the maven, but moved it. In its execution I had no problem, it ran normally.

Your problem seems to me to be caused by encoding, because encoding text that, when not recognized, outputs the character "?". I remember that I had to change my project to use utf8, at that time I wasn't using kumoji, but I had the same problem with the strings. I don't remember what I did to fix this, but try to find a solution for text encoding, or something related to utf8.

Try to put some kanji inside the project in string and check out the console, test to see if it is not the case.

Zurdge commented 4 years ago

encoding was the correct way to go.

Setting UTF8 in maven didn't seem to work however adding

JAVA_TOOL_OPTIONS - -Dfile.encoding=UTF8

as an environment variable allowed Java to know exactly what I wanted it to do.

For the moment this solution works, I'll update this post if I find a better solution.