Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
340 stars 77 forks source link

Latest develop #8

Closed mpkorstanje closed 9 years ago

mpkorstanje commented 9 years ago

Removed chapman and taglink projects. Added matching soundex as an example.

I reckon we should use the gitflow model and not merge with a master branch until we've got a release. I'm using the maven git flow plugin to help with that. It does assume that the names master, develop, feature-, release- and hotfix-* are used.

If you want to draw attention to the development branch rather then show the older code you can change some settings in github to display the development branch by default.

mpkorstanje commented 9 years ago

I had a good long look at Smith-Waterman and Smith–Waterman-Gotoh-Windowed-Affine. The implementation in SmithWaterman.java is actually Gotoh's version of 1982 but withouth the memory saving option and SmithWatermanGotohWindowedAffine.java is actually the original by Smith and Waterman. Wtf.

Gap and substitution functions now work in the same dimensions. A gap is always a negative value, an undesired substitution is a negative value, a similar substitution a positive value. Removed the SubCost 5, 3,-3. Accepted some weird character substitutions that make no sense as a default option (seriously its a mini-soundex), esp not for an alignment algorithm.

Cleaned up Levenst, NeedlemanWunch, Jaro and Winkler.

That should took care of Sam's last remaining work.

jokillsya commented 9 years ago

Ok - it has been merged...

Excellent stuff... Yea - been trying to catch up to some of the weird things I noticed in 2012...

Alright - as far as licensing is concerned - I think we can safely move to a more permissive license for the Java Project.

I'm looking into APL, MIT, BSD and the LGPL.

Any thoughts? I don't think just picking the APL because somebody wants to include the code in an Apache project is a good enough reason to use it, I'm also not a big fan of V3 of the GPL - haven't been for a while now...

jokillsya commented 9 years ago

There is also a separation of concerns topic that can be discussed, i.e. should a third party system be dependent for its proper compilation and function on any of these metrics?

This influences the licensing dramatically - if the code can be dynamically linked, via modules or the like - so usage as opposed to separate modification - then something like the LGPL or the MIT licenses make sense.

The only real concern here is verbatim code implementation in-line in a third party system, which I don't think is a good idea given that a project like hadoop or jetty doesn't require metrics for its normal function - if that makes sense.