apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Java implementation (and improvement) of Levenshtein & associated lexicon automata [LUCENE-4947] #6011

Open asfimport opened 11 years ago

asfimport commented 11 years ago

I was encouraged by Mike McCandless to open an issue concerning this after I contacted him privately about it. Thanks Mike!

I'd like to submit my Java implementation of the Levenshtein Automaton as a homogenous replacement for the current heterogenous, multi-component implementation in Lucene.

Benefits of upgrading include

The code for all the components is well structured, easy to follow, and extensively commented. It has also been fully tested for correct functionality and performance.

The levenshtein automaton implementation (along with the required MDAG reference) can be found in my LevenshteinAutomaton Java library here: https://github.com/klawson88/LevenshteinAutomaton.

The minimalistic directed acyclic graph (MDAG) which the automaton code uses to store and step through word sets can be found here: https://github.com/klawson88/MDAG

*-Transpositions aren't currently implemented. I hope the comment filled, editing-friendly code combined with the fact that the section in the Mihov paper detailing transpositions is only 2 pages makes adding the functionality trivial.- Update introduces transposition inclusion in edit distance calculations!

*As a result of support for on-the-fly manipulation, the MDAG (dictionary-automaton) creation process incurs a slight speed penalty. In order to have the best of both worlds, i'd recommend the addition of a constructor which only takes sorted input. The complete, easy to follow pseudo-code for the simple procedure can be found in the first article I linked under the references section in the MDAG repository)


Migrated from LUCENE-4947 by Kevin Lawson, updated May 03 2013 Attachments: LevenshteinAutomaton-master.zip, LevenshteinAutomaton-master-update.zip, MDAG-master.zip

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Cool but this may be a showstopper: "MDAG is licensed under the GNU General Public License (version 3)."

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot for wishing to submit code!

It's not possible to include your code in Lucene if it has a GPL license. Quite frankly, I don't think even think Lucene committers can even have a look at it to consider it for inclusion with a GPL license.

If you have written all the code or otherwise own all copyrights, would you mind switching to Apache License 2.0? That way, I at least think it would be possible to have a close look to see if this is a good fit for Lucene.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Ah. It slipped my mind that the result of combining GPL and Apache licensed code must be GPL.

I'm fully committed providing it to the Lucene project under the Apache License (version 2) if possible, or if it comes down to it, changing the license entirely.

If the former is possible, consider this post one that formally licenses both LevenshteinAutomaton and MDAG, to the Lucene development team under Apache License (version 2), irregardless of any licenses that may be included with either project, and any license notices contained in any of their constituent files.

If the former is not possible, I suppose I could push up versions of each project licensed appropriately to the general public.

Just tell me what action(s) need to be taken from here.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I'm fully committed providing it to the Lucene project under the Apache License (version 2) if possible, or if it comes down to it, changing the license entirely.

do you own all the copyrights for this GPL licensed product? I haven't looked into it yet but what we need to do here in a nutshell is:

usually the PMC Chair helps here a lot but just FYI this is roughly what you need to go through.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, My comments: Can this tool be used to replace the Code-Generator that creates the large int[] arrays? If this is not the case, the actual implementation of the Levensthein Automaton might be smaller, but there 2 backsides:

Alternatively, can the code be ported away from MDAC to Bricks-automaton, so it can interact with Lucene's Automaton library? If this is not the case, we can no longer easily combine wildcards/prefix/fuzzy anymore.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Alternatively, can the code be ported away from MDAC to Bricks-automaton, so it can interact with Lucene's Automaton library? If this is not the case, we can no longer easily combine wildcards/prefix/fuzzy anymore.

Of course! If you take a look at the tableFuzzySearch() method here, you'll see that it takes an MDAG (which is equivalent in structure to the automatons implemented in Brics) and simply transitions it in step with the LevenshteinAutomaton. The method can be modified easily to accept a Brics automaton, which i'm assuming has methods that implement typical automaton actions (namely transitioning and accept state determination).

The main reason one might want to consider using MDAG is that typically libraries that implement the data structure (which is more widely known as a DAWG) only support creation with sorted input (and thus, do not allow for modification). I believe Brics is no exception. My MDAG library supports unsorted input and run-time modification of the structure. (The minor drawback concerning this has been addressed in the original post).

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

do you own all the copyrights for this GPL licensed product?

Yes. All the materials I'm submitting were created solely by me and not on behalf of any company.

I haven't looked into it yet but what we need to do here in a nutshell is...usually the PMC Chair helps here a lot but just FYI this is roughly what you need to go through.

Great. From the looks of it I'd have no problem submitting those documents. Should I wait for the PMC Chair to come in here? Or can I just submit the grant and license agreement to secretary@apache.org now?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I haven't looked into it yet but what we need to do here in a nutshell is...usually the PMC Chair helps here a lot but just FYI this is roughly what you need to go through.

Great. From the looks of it I'd have no problem submitting those documents. Should I wait for the PMC Chair to come in here? Or can I just submit the grant and license agreement to secretary@apache.org now?

PMC Chair here - I've never shepherded one of these things before, so I need to get up to speed. I glanced through the links Simon sent (thanks Simon), nothing seems too difficult. I'll read more thoroughly and get back to you here.

One potential issue that will need to be resolved first: from my past experience, the threshold at which code grants need to be invoked seems fuzzy to me: my previous takeaway had been that the quantity of the contribution, both in number of files and in line count, is a consideration: only a couple of files, or only a couple hundred lines of code, don't warrant a code grant. I looked at the git repo you pointed to, Kevin, and it seems to have more than a couple of files, and more than a couple hundred lines of code, so I'm pretty sure Simon's right, the code grant process will have to be invoked.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

It sounds proper to do a code grant also because the software currently has a GPL license. Thanks for following up, Steve.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Kevin,

You can license your code under multiple licenses if you like. The simplest thing to do here is to license your code with a single license: the Apache License v2.

For info on the Apache License v2, see http://www.staging.apache.org/licenses/. See also http://www.apache.org/legal/resolved.html for a list of licenses which can and cannot be included as dependencies in Apache products. AFAICT, though, once you have signed and submitted the software grant and provided a tarball of the code, your grant is under the terms of the Apache License v2, and the license headers in files committed to the Lucene codebase will be modified to include a reference to the ALv2, and exclude any other license information.

FYI, attribution for individual contributions is located in a single file: lucene/CHANGES.txt, e.g. for trunk: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt

I'll be filling out the XML version of this (HTML) IP clearance form: http://incubator.apache.org/ip-clearance/ip-clearance-template.html - I encourage you to take a look, Kevin.

Steve

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Just updating the thread to notify everyone that I've just e-mailed the ICA and code grant documents (and their GPG-related files) to secretary@apache.org.

Is there anything else that needs to be done on my part?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Is there anything else that needs to be done on my part?

Hi Kevin Lawson,

What did you send to secretary@apache.org? We need to know exactly what it is you're donating, so that we can start the vetting and header modification process (assuming you have not yet done any header modification yourself).

This is generally done by transferring a compressed tarball to a public place, to provide access to those who wants to inspect it, and those who will do the header modification. But maybe this could be done via Git hash(es)? I looked at a bunch of the existing examples of this process <http://incubator.apache.org/ip-clearance/&gt;, and all of them go the compressed tarball route.

Steve

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Would it be possible to change the licensing directly on github and then make an import from a git revision? It's traceable after all (with parent etc) and it would make the process simpler I think (?).

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Would it be possible to change the licensing directly on github and then make an import from a git revision? It's traceable after all (with parent etc) and it would make the process simpler I think (?).

I think it's possible.

Kevin in his paperwork will have had to have referred to the git revision, but he wrote he sent GPG-related files, which doesn't seem compatible to me, since detached signatures will AFAIK have been generated against a local copy, not the Git repository, and so may not be reproducible by third parties (e.g. are Git checkouts bit-for-bit identical on Linux vs. Windows?).

I think the intent of the signature is just integrity/identity: a way to refer to the exact bits being donated. Git hash(es) should serve the same function.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Git checkouts may not be bit-for-bit identical (they may vary if you have custom EOL treatment in .gitattributes, for example) but a git revision hash uniquely identifies a revision in a repository (and this, to my understanding, cannot be easily forged since it includes all parent revisions). So if you git clone/export a given hash then it's essentially the same as if somebody sent you a tarball with file checksums?

I don't know anything about the legal aspects (if it makes a difference whether you're pulling from an ASL-licensed copy compared to the author taking the initiative).

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Git checkouts may not be bit-for-bit identical

Right, I mentioned this because Kevin wrote that he had sent "GPG-related files" to secretary@a.o, by which I assume he means detached signature(s), and if a checkout is not bit-for-bit identical, we will have to ignore those signatures, since they won't be reproducible, and that may have procedural implications: can't move beyond the step where you have to verify the signature.

Of course, if the Git hash(es) are sufficient (and I agree with you, Dawid, that they seem to be), then it should be fine to ignore the signature(s) Kevin sent.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Sorry for the ambiguity.

When I wrote "and GP-related files", I meant the signatures for the ICA and software grant documents. The documents made no mention of signing the tarballs to be donated (the ip-clearance form only mentions it conditionally in step 5).

I assumed you guys would simply grab the tarballs from the GitHub links I posted.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I assumed you guys would simply grab the tarballs from the GitHub links I posted.

Okay, cool, I think that will work.

Do you intend for these two projects to live on after they've been incorporated into Lucene? If so, then I'll fork them on Github and start making license header changes -> ALv2; Kevin, do you give your consent for me to change the license headers in all files to point to ALv2?

If you don't intend for these two projects to continue life separately from Lucene, then I think it will make sense for you to do the license changes in-place yourself, Kevin. Alternatively, you could grant write access to someone else to do the work. Please let us know.

I have started the IP clearance form. It's online now at http://incubator.apache.org/ip-clearance/lucene-levenshtein-automaton-mdag.html.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Kevin,

I think it's best that you do the license change yourself and that we don't have any active role in making the change since you are the only person entitled to make the change.

This change can be done by using the below header on all the source code and other relevant text files:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

After this has been done, please make a tarball and attach it to this JIRA and indicate that this is the code you wish to grant and also inform us about the MD5 hash of the tarball. (This will go into the IP-clearance document and will be used to identify the codebase.)

It's a good idea to also use this MD5 hash as part of Exhibit A in the software-grant.txt agreement unless you have signed and submitted this already. (If you donate the code yourself by attaching it to the JIRA as described above, I believe the hashes not being part of Exhibit A is acceptable.)

Please feel free to add your comments, Steve.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think it's best that you do the license change yourself and that we don't have any active role in making the change since you are the only person entitled to make the change.

+1

After this has been done, please make a tarball and attach it to this JIRA and indicate that this is the code you wish to grant and also inform us about the MD5 hash of the tarball. (This will go into the IP-clearance document and will be used to identify the codebase.)

I and Dawid had been advocating using Github for this, but I agree with Christian: a tarball attached to this issue by you, Kevin Lawson, will remove all ambiguity about what is being donated and by whom. Also, Github is not under ASF control, and in the future if that business goes under, the ASF will lose the history of this donation.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Hi all.

I've changed the license to the Apache License 2.0 and have modified all of the license notices. The checksums for the archives (which have been attached to this issue) are below:

LevenshteinAutomaton-master.zip MD5 checksum: 081b417edbd7d2a562085e1c0dfb0a4c

MDAG-master.zip MD5 checksum: 109e99dca700e02d1ad54306688472a5

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Just updating the thread to notify everyone that I've just e-mailed the ICA and code grant documents (and their GPG-related files) to secretary@apache.org.

I monitor commits to the ICLA and code grants record files, and neither the ICLA nor the code grant document has been recorded yet. I'll post on this issue once the code grant has been recorded.

Kevin Lawson, did you send the code grant to legal-archive@apache.org in addition to sending it to secretary@apache.org? This is mentioned as a requirement in step 3 of the process section in http://incubator.apache.org/ip-clearance/ip-clearance-template.html.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

Kevin Lawson, did you send the code grant to legal-archive@apache.org in addition to sending it to secretary@apache.org? This is mentioned as a requirement in step 3 of the process section in http://incubator.apache.org/ip-clearance/ip-clearance-template.html.

Ah, I definitely overlooked that. Sent! Was done a couple of hours ago, but I figured I'd notify you here.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Ah, I definitely overlooked that. Sent! Was done a couple of hours ago, but I figured I'd notify you here.

Kevin Lawson, I'm concerned that the Lucene PMC has not yet received notification from the Apache Secretary of either your code grant or your ICLA. We got notification earlier today for SooMyung Lee's paperwork (see LUCENE-4956), which was sent to secretary@apache.org after yours.

Is it possible that you sent the paperwork to the wrong address, or that there was some other mail snafu?

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

The address I sent it to is indeed correct (secretary@apache.org). Maybe it's something on their end (spam filter)? I've just re-sent it.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

The Lucene PMC received notification today from the Apache Secretary that Kevin's code grant and ICLA paperwork have been received and recorded.

Now that we have Kevin's code grant and ICLA, we can start verifying headers/licensing. Since it's not clear where this software will go, I don't think it makes sense to create a branch yet. People doing the header/license verification/modification can just attach modified tarballs to this issue if necessary.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

I had some free time yesterday and updated LevenshteinAutomaton to include transpositions in its edit distance determination methods. The included tests have also been updated to accommodate the modifications and ensure correct functionality.

How do I go about submitting the updated code? Push an update to my github? Simply attach the new archive here? I'm not sure where the code donation/acceptance process is at now, so I'm unsure of how to do this.

asfimport commented 11 years ago

Kevin Lawson (migrated from JIRA)

I've attached an archive of the most recent version of LevenshteinAutomaton, which includes transpositions in the edit distance calculations.

LevenshteinAutomaton-master-update.zip MD5 checksum: 8187978b438435ac2d23981bb0133d34