Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) [LUCENE-1284]

asfimport commented 16 years ago

Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org). Morphological information is used to index new documents and to process smarter queries in which morphological attributes can be used to specify query terms.

The tool makes use of morphological analyzers and dictionaries developed for the open-source machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech taggers developed for it. Currently there are morphological dictionaries available for Spanish, Catalan, Galician, Portuguese, Aranese, Romanian, French and English. In addition new dictionaries are being developed for Esperanto, Occitan, Basque, Swedish, Danish, Welsh, Polish and Italian, among others; we hope more language pairs to be added to the Apertium machine translation platform in the near future.

Migrated from LUCENE-1284 by Felipe Sánchez Martínez, 2 votes, updated May 16 2011 Environment:

New feature developed under GNU/Linux, but it should work in any other Java-compliance platform

Attachments: apertium-morph.0.9.0.tgz

asfimport commented 16 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Patch file containing all the new classes created. The patch will create a new folder in contrib. No existing code is modified.

asfimport commented 16 years ago

Felipe Sánchez Martínez (migrated from JIRA)

All the files compressed together. Decompress in the lucene trunk folder

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

This sounds very promising. I unpacked the .tgz file and tried running 'ant compile' within contrib/apertium-morph, but got compilation errors.... I tried fixing build.xml, but don't actually see the problem there.

I see a typo in a package name: src/java/org/apache/lucene/benckmark/ (should be benchmark)

I'd love to try this, so if you can fix build.xml or help me figure out how to fix it, that would be great.

asfimport commented 16 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Typo in a package name: src/java/org/apache/lucene/benckmark/ (should be benchmark) solved.

build.xml fixed. I have tried on a clean SVN version and it compiles without errors. Using sun-java-6.

Forget the previous attachments.

– Felipe.

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Thanks, I'll have a look later this week. Note that if you always use the same file name for attachments, JIRA will manage them for you and you won't have to delete old ones. Use a name such as LUCENE-1284.patch or LUCENE-1284.tgz or some such.

asfimport commented 15 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Kind remider

Otis,

could you check if everything is ok with the last attachment (from May 2008).

Thanks a lot – Felipe.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hadn't seen this before. Thanks Felipe! This looks like a high quality contribution.

I've expanded the attached file into contrib and built and ran the tests. Everything went smooth.

I've only began to look at the code myself, but a couple initial comments:

Could you remove the @author tags? The Lucene project has decided its best to leave them out (you can search the mailing list if you are interested in the discussion).

How about renaming overview.html to package.html and expanding what you have there? This looks like a very useful addition, but its complicated enough to merit a more thorough overview and/or examples of how to get started. Not everyone wades into the contrib packages that often - lets hook those that do by providing a very clear: "This is what this is, this is what you can do with it, and here is how you do it". Nothing too intense, but enough to understand its usefulness quickly (and allow you to gauge the effort required for use).

As an example of seemingly missing info I am wondering about: where do I get the data files? I see a link to http://www.apertium.org, but digging a bit does not immediately show me what I am looking for. Clear instructions on how to get going with your preferred morphological data files would be great (as well as clear instructions on where and how to obtain those files).

Thanks for donating this code! Its something I have been interested in seeing added to Lucene for some time.

Mark

asfimport commented 15 years ago

Felipe Sánchez Martínez (migrated from JIRA)

I have uploaded the package as it was released as part of the Apertium project (http://www.apertium.org). It contains a brief README file and an example of use in the "example" folder.

To benefit from this package the texts to be indexed need to be preprocessed using some Apertium tools. These tools can be downloaded from the Apertium web page at sourceforge (http://sourceforge.net/projects/apertium/). You need to install the following packages: lttoobox, apertium, and the linguistic package you are interested in (with the name apertium-xx-yy).

Mark, could you point me to the discussion about the @author tag?

– Felipe.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think there may have been more than one thread on the subject. You should be able to dig them up with one of the searchable archives: http://www.lucidimagination.com/search/p:lucene/s:email/l:dev?q=author

I'm not sure if the removal of all current @author tags has been completed yet, but it will be (work on that issue pops up here and there and I am unsure if its completed). My current stance is that I would remove @author tags before committing code myself.

There are a variety of reasons, but to boil down my take: recognition for contributions is listed in CHANGES and JIRA, and donated code often ends up having multiple authors - -something that has not been tracked well by the @author tags in the past. Other reasons can probably be gleaned from the discussions.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

One other quick note: the copyright in the @author tag is not allowed in any case if the code is to be committed. There can be only one copyright line specifying The Apache Software Foundation.

Because this is a complete contrib package you are contributing, it is permitted to put something along the lines of "originally written by ...". This should go after the copyright and license header.

Like I said though, the preferred Apache method for credit is the CHANGES file.

Thats the info I've been able to dig up anyway.

Mark

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Felipe - I'll have a look at this next week, thanks for the reminder!

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Felipe: I took another look at this. I spotted mentions of GPL, but it's not clear to me what's GPLed. We can't have GPL software in Apache, unfortunately. Could you please explain which pieces are GPLed and tell us if this is something that could be changed to ASL? Thanks.

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

One more for Felipe. Is there a page on http://wiki.apertium.org/ that lists the definite/up to date list of supported languages and perhaps some kind of indicator of status (e.g. anyone actively working on the language or not) and level of support.

I see http://wiki.apertium.org/wiki/List_of_language_pairs and http://wiki.apertium.org/wiki/Language_and_pair_maintainer

...but I can't quite translate (no pun intended) those numbers into the level of support for a language. Could you please shed some light on this?

asfimport commented 15 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Hi Otis,

The package I submitted to Lucene has a dual license, so it is both GPL v2.0 and ASL at the same time. Is this a problem?. Apertium is GPL v2.

There is a huge community around Apertium developing language pairs for it. Actually, this year Apertium is in the Google Summer of Code. The language pairs mentioned in http://wiki.apertium.org/wiki/List_of_language_pairs are those under development; the language pairs you can download from sourceforge (http://sourceforge.net/projects/apertium/ ; packages with name apertium-xx-yy) are the ones that have been released; anyway, they are updated from time to time with further improvements. Their version numbers will help you on making and idea of the state of development and the translation quality you can expect.

Hope this helps – Felipe.

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Hi Felipe,

OK, I looked at this some more. So the Java code you contributed is ASL and Apertium's tools (and data?) is GPL v2?

The thing that puzzles me are the language pairs themselves. Why are they in pairs? Is that simply for the translation part of Apertium, and something that's ignored when you use the pair for Lucene and morphological analysis?

If I'm interested in, say, French morphological analyzer, why do I need any other language? For French, I see:

br-fr
en-fr
fr-ca
fr-es

If I'm interested in French, which of the 4 above is the right one to use? The one with the highest number of lemmata?

I had a look at the Indexer and Searcher to get an idea about the usage. Those classes are really just for demonstration, right? Still, do you mind replacing the deprecated Hits object in the Searcher class?

In the README you mention this:

2. The Spanish morphological dictionary must be preprocessed in advance to remove multiword expressions:

$ java -classpath lucene-apertium-morph-2.4-dev.jar \ org.apache.lucene.apertium.tools.RemoveMultiWordsFromDix \ --dix apertium-es-ca.es.dix > apertium-es-ca.es-nomw.dix

Could you explain why the removal of multiword expressions is needed? Is that Spanish-specific or something one needs to do regardless of the language?

Also:

4. Each file to be indexed must be preprocessed using the Apertium tools:

$ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin | apertium-tagger -g -f es-ca.prob > file.pos.txt

So these are a few command-line tools that end up marking up the input text with POS? (I seem to be missing some libraries and can't compile Apterium locally to check what that this marked up file looks like). But my main question here is whether there are Java equivalents of these command-line tools, so that one can easily use them from Java? Or is one forced to use Runtime.exec(...)?

Thanks.

asfimport commented 15 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Hi Otis,

The Java code I contributed is ASL and GPLv2 (dual license). Apertium tools and data are GPL v2.

> Why are they in pairs? Is that simply for the translation part of Apertium, and something that's ignored when you use the pair for Lucene and morphological analysis?

Yes, they are language pairs because of the translation. If you are not interested in translation (as is our case) you can used whichever language pair containing the language you are interested in; choose the language pair with the highest number of lemmata, probably the one with the highest version number.

> Do you mind replacing the deprecated Hits object in the Searcher class?

Which is the new class I should use?

> Could you explain why the removal of multiword expressions is needed?

Multiword units need to be removed from the dictionary mainly because they are there to facilitate the correct translation of some expressions to the target language. This is not Spanish specific and should be done in all cases.

> So these are a few command-line tools that end up marking up the input text with POS?

Yes.

> I seem to be missing some libraries and can't compile Apterium locally to check what that this marked up file looks like.

You need to install lttoolbox, you can download it from the Apertium web page.

> But my main question here is whether there are Java equivalents of these command-line tools,

Unfortunately, no :(

Regards. – Felipe

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Hm, I feel that because of these command-line non-Java and GPLed tools it may not be possible (or will be very clunky) to integrate this with Lucene.

What do others think?

Felipe, although Java equivalents of those command-line tools don't exist currently, do you think one could implement them in Java (and release them under ASL)? I don't know what exactly is in those tools and what it would take to port them to Java. Thanks.

asfimport commented 15 years ago

Felipe Sánchez Martínez (migrated from JIRA)

Hi,

I think that the fact that the tool relies on an external free/open-source package to pre-process the files to be indexed should not be an obstacle for the community to benefit from them; the world is pretty heterogeneous ;). Furthermore, they are not required at search time.

> Felipe, although Java equivalents of those command-line tools don't exist currently, do you think one could implement them in Java (and release them under ASL)?

This year the Apertium project is in the Google Summer of Code. A student will port the ltoolbox package to Java. Note that the tool I contribute also uses the apertium tagger and that this tool will not be ported; fortunately the usage of the tagger is optional. The Java version of lttoolbox will be released under the GPL license, I am not sure if they will accept to give it a dual license.

– Felipe

asfimport commented 13 years ago

Kevin Brubeck Unhammer (migrated from JIRA)

A little update: The Java port of lttoolbox has been complete for some time now, and the port of apertium-tagger at least does disambiguation (training of models, the .prob files, is not supported yet, but all released pairs come with .prob files so that's a non-issue):

$ echo 'jeg' |apertium-destxt-j |lt-proc-j  nb-nn.automorf.bin | apertium-tagger-j -g nb-nn.prob -f
^jeg/jeg<prn><p1><mf><sg><nom>/jeg<n><nt><sg><ind>$^./.<sent><clb>$[][
]

The GsoC student Stephen Tigner is working at the moment on making sure they are all usable as libraries; from what I understand there is just minor cleanup work left on that.

I can't say anything on license issue though. Other than Stephen Tigner, the most active contributor on the port is Jacob Nordfalk.

apache / lucene

Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org) [LUCENE-1284] #2361