dkpro / dkpro-uby

Framework for creating and accessing UBY resources – sense-linked lexical resources in standard UBY-LMF format
https://dkpro.github.io/dkpro-uby
Other
22 stars 3 forks source link

Nonsensical loop in GermaNet converter's LexicalEntryGenerator #135

Closed logological closed 9 years ago

logological commented 9 years ago

The method de.tudarmstadt.ukp.lmf.transform.germanet.LexicalEntryGenerator.getLEPOS(Set<LexUnit>) consists of a for loop over a Set, but its last statement is break (and the loop contains no continue statements), so the loop will never get executed more than once. This is either a bug, or else a very confusing way of performing an operation on a single, arbitrarily chosen element of a Set. I have no idea which of these explanations is correct.

If this loop really is intended to get a single arbitrary element of the Set, then it should be replaced with the following:

LexUnit lu = luGroup.iterator().next();

Otherwise this is a bug and someone more familiar with the logic of the class should fix it. (Maybe it's as simple as removing the break statement.)

reckart commented 9 years ago

Getting an arbitrary/random element from a doesn't sound like a good idea.

logological commented 9 years ago

It's OK if you know a priori that all the elements in the set have the same property, and just want to find out what it is. Again, I have no idea whether that's the case here.

reckart commented 9 years ago

Still... using a set means that hashcode and equals should be implemented based on some properties of the objects in the set - a set using object equality is meaningless, better use a list.

So assuming that the objects in the set only have a single property and they all share the same value, that would imply that the set contains none or only one object.

Assuming that the objects have more properties that are covered by the hashcode/equals methods, then choosing a random element from the set again appears a bad idea.

Consider you write a test case that takes an object, serializes it to a string, and compares this string to a reference string. It is typical that such serializations cover multiple properties of the object. So the test may fail randomly. Of course you could and probably should test only over the desired property. When writing a focussed unit-test, this is easily possible. However, when the random value only contributes a to a small part of a larger test, e.g. an end-to-end test of an experiment, having randomly fluctuating values in a result dump can be annoying.

logological commented 9 years ago

So assuming that the objects in the set only have a single property and they all share the same value, that would imply that the set contains none or only one object.

Elements can have more than one property. Just because a set of elements have one property in common doesn't mean that the elements are identical.

reckart commented 9 years ago

Elements can have more than one property. Just because a set of elements have one property in common doesn't mean that the elements are identical.

Right. That's what the rest of my comment was about. An API that produces random output (that is in that case the other properties of the returned object) is more difficult to test. It can also create unreproducible experiments as the user of the API may not always be aware (may just forget) that only one property is guaranteed to be of a given value while the others are essentially random.

tr;dr: random output -> bad to test, harder to work with

judithek commented 9 years ago

thanks for pointing this out, Tristan. I am afraid there is currently noone around who is more familiar than you or I with the logic of the class. If you think this is a serious bug that should be fixed, it would be great if you could take care of this. If would be good though, if you could make sure that nothing else breaks in the GermaNetConverter.

chmeyer commented 9 years ago

I had a quick look at the code: The method extracts the part of speech tag from a set of lexical units for a given lexical entry. By design, each lexical unit in this set should have the same part of speech. There is a slight chance of a data inconsistency which is probably why the author of this code planned a for loop to check the next lexical unit.

I agree that the implementation is somewhat inconsistent, and I can offer to fix this in the future. There is however no bug as far as I can see.

chmeyer commented 9 years ago

Fixed.