dkpro / dkpro-similarity

Word and text similarity measures
https://dkpro.github.io/dkpro-similarity
Other
53 stars 22 forks source link

Add semantic interoperability layer (switch from DKPro LSR to UBY) #39

Open nicolaierbs opened 9 years ago

nicolaierbs commented 9 years ago

Many similarity metrics use a lexical semantic resource for computing similarities, e.g. WordNet or Wiktionary. These resources are loaded using DKPro LSR (https://github.com/dkpro/lsr).

We can replace DKPro LSR with UBY to be able to use more resources and potentially combine information from different resources. This requires something like a "semantic" interoperability layer.

The following text is collected from a discussion between Tosten and Judith:


With "semantic" interoperability layer, I meant that LSR was mainly designed for use in semantic relatedness computation - probably a different reading of semantics than what you had in mind. LSR makes quite strong assumptions regarding what are entities, relations, etc. - i.e. it sometimes somewhat redefines the semantics e.g. of what is a synonymy relation. This is mainly done for Wikipedia though as the other resources are more alike. In Wikipedia, e.g. we define article redirects to be synonyms

My proposal is to replace (for all resources where this makes sense) the current wrapper that relies on the native API with one that used the Uby API.

In Wikipedia, e.g. we define article redirects to be synonyms the converter for Uby-Wikipedia sets the redirects to RELATED:

                senseRelation.setRelName(ERelNameSemantics.RELATED);
                senseRelation.setRelType(ERelTypeSemantics.association);

My proposal is to replace (for all resources where this makes sense) the current wrapper that relies on >>the native API with one that used the Uby API.


I looked into WordNet, GermaNet, Wiktionary, and this all looks feasible. Actually, this is an interesting exercise which might improve the UBY API In most cases (not Wikipedia, GermaNet), LSR could then also use Uby databases packaged as Maven artifacts.

Even OpenThesaurus might be wrappable in the near future as Christian M. recently completed a Uby converter for that.

However, replacing the wrappers with the UBY API will take some time - also depending on who will perform the changes. I see at least 3 tasks:

altogether estimated 2 days for an experienced Uby developer which is a lot.

Alternatively, changing the wrappers one after the other, resource by resource?