apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.03k forks source link

Kuromoji code donation - a new Japanese morphological analyzer [LUCENE-3305] #4378

Closed asfimport closed 12 years ago

asfimport commented 13 years ago

Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere.

The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji.

Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit.

We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself.

Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt.

I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that.

Please advise on how you'd like me to proceed with this. Thank you.


Migrated from LUCENE-3305 by Christian Moen (@cmoen), 6 votes, resolved Jan 14 2012 Attachments: ip-clearance-Kuromoji.xml (versions: 2), kuromoji-0.7.6.tar.gz, kuromoji-0.7.6-asf.tar.gz, Kuromoji short overview .pdf, kuromoji-solr-0.5.3.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, LUCENE-3305.patch (versions: 2), wordid0.patch

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Kuromoji - a Japanese morphological analyzer

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Kuromoji Solr integration

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

MD5 hashes for the attachments are as follows:

MD5 (kuromoji-0.7.6.tar.gz) = 70d3d2f69f0511b86ebe11484cbe1313
MD5 (kuromoji-solr-0.5.3.tar.gz) = b9a54698c9aebc264845e64d3904642d
asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Attaching a brief presentation

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

WOW this is awesome. It seems we need to file some IP clearance here since this is a substantial contribution not developed in the ASF source control or on the mailing list. I will figure out the process here.

I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file. We also need to apply these sources to our source tree so it is very likely that this goes under /modules/analysis/common can you try to create a patch against trunk? if its is too much of a hassle you can also move the solr integration to a different issue.

thanks simon

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot, Simon. I wasn't sure when we'd update the headers as part of the process, so thanks for clarifying that, too.

Kuromoji downloads IPADIC as part of its build (from our server in Japan) to make its data structures, which it bundles into its jar file (becomes 11M, but can be made a lot smaller). Building also requires more than default heap-space, so it's build is a little convoluted and different from the other code in /modules/analysis/common.

Kuromoji is also usable independently from search, although, even though search perhaps is its most important application. Would it be a good idea that I make a patch that puts it in /modules/analysis/kuromoji for now and that we take things from there?

The quickest way to get Kuromoji in there would be to check the jar file /modules/analysis/kuromoji/lib, but I'm not sure that's a good way to go.

I'll follow up in whatever way you prefer. Thanks again! :)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file.

But these things are separate, right? Can't he just fix the license headers and upload a new .tar.gz?

I don't see anywhere that says a code grant should be a patch, this puts a burden on Christian to do all the work, and our trunk moves too fast. Lets defer creating a patch until the code grant stuff is over... anyone could then turn it into a patch.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

But these things are separate, right?

Right - looks like all we need is the ASF copyright in the files. The rest can easily be handled after the grant goes through.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Robert and Mark.

I'll upload new tarballs where the standard ASF license notice is being used in all Java source files and I've also removed author tags to comply better with code standards. I've removed any Atilika Inc. copyrights from NOTICE.txt in both tarballs.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Now uses standard ASF license notice in all Java source files.

MD5 (kuromoji-0.7.6-asf.tar.gz) = a84f016bd5162e57423a1da181c25f36
asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Now uses standard ASF license notice in all Java source files.

MD5 (kuromoji-solr-0.5.3-asf.tar.gz) = a3e7d5afba64ec0843be6d4dbb95be1c
asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Code looks cool. I think we should first do the legal stuff and then produce patches. Robert is currently developing another morphological analyzer (Lucene-Gosen, https://code.google.com/p/lucene-gosen/), but this one uses a LGPL library that cannot be included with Lucene/Solr. The Lucene part has lots of cool attributes and additional TokenFilters, so maybe we combine lucene-gosen with this one (your Apache-2.0 and his TokenFilters+Attributes)? That would be really cool.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Uwe!

I think we definitely should work together and combine the great work that Robert, Koji & co. have been doing on Lucene-GoSen with Kuromoji to make a highly attractive Japanese linguistics offering that is also an integrated part of Lucene/Solr.

The attributes do indeed look very nice – excellent job! I have several improvements in mind for Kuromoji (and other Japanese related code) and I'm looking forward to working with you to improve some of these things.

Additional to its license, an issue with GoSen (and Sen) used to be its segmentation quality. To my knowledge, these analyzers still don't support so-called "unknown words" which means that words that are not in the dictionaries are treated second-rate, which impacts negatively on segmentation quality.

asfimport commented 13 years ago

Koji Sekiguchi (@kojisekig) (migrated from JIRA)

Hi Christian, it's been a long time. Contribution of Kuromoji to Lucene/Solr sounds really nice! As already Uwe mentioned, lucene-gosen has really good TokenFilters, those are org.apache packages and Apache License. It will be nice if this Japanese tokenizer uses them. Plus, lucene-gosen can use not only IPADIC, but also NAIST JDIC. I'd like the tokenizer to choose dictionary in the future release.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

久しぶりですよね。 Thanks a lot, Koji. :)

I completely agree. If we can get Kuromoji into the codebase, I'm more than happy to submit patches for your filters so that they will work with Kuromoji.

Kuromoji has preliminary support for UniDic and it sounds like a good idea to join effort on this as well. We could support them all; IPADIC, NAIST JDIC and UniDic.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Please let me know if you need paperwork from me to follow up on this. Thanks again.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Hey Christian, I attache the IP-Clearance form for this code donation. What we need to wrap up this process is

The CLA should go to the secretary, I still need to figure out where the code grant needs to go.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

koji, I took the issue until the code grant is due etc.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Simon. Please let me know where I should send the code grant and I'll file the paperwork.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello again, Simon. Has there been any update as to where I should send the code grant? Many thanks.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Christian, apparently we just handle this as the CLA. You fill it out, scan it and send it to secretary@apache.org. Make sure you use the ICLA details when you file it.

let me know once you those are send.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I am going to be away for 2 weeks if somebody wants to continue driving this code grant. please do. Otherwise @christian sorry for the break I will continue once I am back or here and there if I find a computer :)

simon

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello Simon. I'll file the paperwork over the next couple of days by email and copy you. Have a brilliant vacation! :)

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello again, Simon. I've filed the paperwork and copied you on email. Hope you're enjoying your vacation!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Christina, thanks for filing the paper work, I just called out a vote on dev@l.a.o hope to get this done soon!

simon

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Christian, I see a couple of files in the resource folders that don't have a license header, we need to make sure that all files do have an ASL 2 license header before we can finish the IP clearance process. Yet, I don't know much about this segmenter but I guess it works based on a dictionary, no? If so where are the dictionary files since I only see resource files in the test folder but maybe I miss something?

simon

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Please see NOTICE.txt for information on the dictionaries.

Kindly let me know which files that require a license header and how I should proceed to provide a revised version. Do you prefer a complete tarball or can I attach the filed individually to this JIRA?

Thanks!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Please see NOTICE.txt for information on the dictionaries.

so those dictionaries are not ASL licensed, right? I need to check with legal if we can include them into our distribution at all so we need to figure that out first.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Correct. You should definitely check this with legal. I've tried to point this out in the description and in my email with the secretary as well. If there are questions or concerns my legal counsel can possibly assist, but I guess this is something the ASF has to consider by itself.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

FYI - I created an issue on legal to categorize the IPADIC license LEGAL-97

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Now that we have some feedback on LEGAL-97, what is the next step we need to do to move forward with this feature?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

According to LEGAL-97 we can include the dict files. That means we can finish this code donation and get everything in shape for a commit. I will finish the paper work once I am back from traveling.

asfimport commented 13 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks for the follow-up, Robert and Simon. I've started working on a patch.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is an updated ip-clearance file. Since this is the first time I do this I would appreciate some feedback or help from other with more experience here. Grant, does that look fine to you?

I think if we are ok with this we can go ahead and call the vote on incubator.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just a ping... whats our next step?

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

File looks good to me. You need to check in the file to https://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance and then call a vote on general@incubator.apache.org (there should be examples of this in the archives for that list). Vote is lazy consensus, so don't expect too much feedback. Once that vote passes, then the code can be committed.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed the file to the incubator ip-clearance in revision 1199470. I will go ahead an call an incubator vote now. thanks grant

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I send the vote to general@incubator ...we will see in 72h! thanks folks

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is an initial patch. nothing special just basic integration into the modules/analysis tree. I added a task taht downloads the dicts and puts them in place so I could run the tests. all passing for me... still lots of work but its a start

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

looks like we want to add the Lucene analyzer/tokenizer and solr factories from kuromoji-solr-0.5.3-asf.tar.gz

I'd say once we get stuff going, maybe just download the dictionary, build it, and when committing commit the built dictionary under resources/ folder (this is where the script puts it).

I think for this kind of feature it might be hard to iterate with patches, we should maybe try to get it in SVN (trunk) initially and iterate with smaller issues. The code looks pretty clean to me already.

The produced jar file is somewhat large but I think its still reasonable, so I think we should look past this for now? working with Sen before I know some ways we can shrink this a lot, but that would be best on a future issue.

Some java6 apis are here (e.g. unicode normalization). Christian can you confirm this is only for the dictionary-build stage? It looked to me like its only needed for ipadic/unidic parsing, but not custom dictionary support.

If its only for the build stage, personally I think thats fine for 3.x too, because I'm suggesting we commit a 'built' dictionary and we tell people if they want to compile the dictionary themselves they need java6? We could put the dictionary-building under a tools/ directory thats java6-only, or we could depend on ICU for just the tools/ piece (i think we already have such hacks for generating jflex rules for StandardTokenizer) and be fine on java5.

+1 for the GraphVizFormatter...

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+1 to all your comments. For 3.x lets figure this out somewhere else... first iterate on trunk and when we have it at a reasonable stage we backport it to 3.x. The vote succeeded so we are good to go!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

marking fix version 4.0 - lets open a new issue for backporting...

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot, Simon!

Robert, I agree completely with your comments. The Unicode normalization is only done at dictionary build time. Simon has turned it on by default – its previous default was off. Perhaps it makes sense to have it on in Lucene's case...

Simon, the TokenizerRunner class doesn't seem to be included in the patch, which might be fine. It's not strictly necessary for Lucene, but I think it's useful to keep it there so the analyzer can easily be run from the command line. The DebugTokenizer and GraphvizFormatter is there already, which aren't strictly necessary either, but sometimes quite useful, so I'm think we should add the TokenizerRunner as well – at least for now.

Tests didn't pass in my case, but I'll look more into this soon. My tomorrow is very busy, but I'll have time for this on Wednesday.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I created a branch here (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3305) with an initial import of this code, only minor tweaks to get things working in the build so far.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Robert.

I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.)

I really appreciate the efforts yourself and Simon have put it. I also hope to make some meaningful contributions to make sure Kuromoji integrates and works works well with Solr and Lucene.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.)

This sounds like a bug in the build, you shouldn't have to do that (it should be set already). However, my default encoding is UTF-8 so thats why i didn't catch it. I'll look into this.

asfimport commented 12 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Patch to fix zero wordid issue. Backport of fix from kuromoji 0.7.7-SNAPSHOT on Github.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Christian, thanks for the fix. I will aply the patch to the branch. The tests testYabottai() and testTsukitosha() are not hurting, but have no meaning for our variant, because wordid=0 and last wordid have different words (because we presort the whole dictionary for the FST). To make the test really use wordid=0, I should lookup the actual dictionary entries of first and last word.

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed development branch revision: 1229948 Thanks Christian!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thank you for fixing that bug!

By the way, I've been reviewing the differences between mecab and kuromoji. In general the differences seem fine to me, actually in Kuromoji's favor (at least for search). Most revolve around middle-dot:

sentence: 私がエドガー・ドガです。
mecab: [私, が, エドガー・ドガ, です]
kuromoji: [私, が, エドガー, ドガ, です]

So I think these are improvements, at least for search (e.g. Kuromoji splits the first/last name here).

But, there is often funkiness revolving caused by the normalizeEntries option, which, if an entry is not NFKC-normalized, it adds an NFKC-normalized entry with the same costs etc.

However, I think in some cases this skews the costs because e.g. half-width and full-width numbers have different costs. So by adding normalized entries with the full-width cost, we sometimes get worse tokenization.

sentence: Windows95対応のゲームを動かしたいのです。
mecab: [Windows, 95, 対応, の, ゲーム, を, 動かし, たい, の, です]
kuromoji: [Windows, 9, 5, 対応, の, ゲーム, を, 動かし, たい, の, です]

I changed the default locally of 'normalizeEntries' to false and it seemed to totally fix this, and all the differences vs. mecab then seemed positive.

I think we should disable normalizeEntries by default so that no costs are potentially skewed... opinions?