Corpus QA, inconsistent handling of feature value data types

GenevieveGorrell commented 5 years ago

Your key annotations have feature values that are integers. Your response annotations have those integers as strings. (Or vice versa.) In Corpus QA, the classification tab overlooks it, but the F-score tab doesn't. The F-score calculation considers them wrong because the data type is wrong. This can be confusing, as you get a normal looking output for classification but a zero for F-score.

greenwoodma commented 5 years ago

Out of interest anyone know what the Annotation Diff tool does in this instance?

deansong commented 5 years ago

Annotation diff tool seems does the same, 100% mismatching between learning framework output and Key set.

johann-petrak commented 5 years ago

Hmm we should sure be consistent, but it is a matter of taste if we consider the string representation of some value to be identical to the value here. I think we have similar ambiguities in other places, e.g. with the conditional controller, where we always check for a string, but cannot remember if anything that is not a string is considered a mismatch or gets converted to string first. I am sure there are other situations where this is an issue and maybe we should have a consistent strategy for the type-handling issue?

I think one problem here is that if a user edits a non-string value in the feature editor it will become a string. On the other other hand one could argue that for classification the target and response should always be a string anyway (but with binary classification that would mean that true and false or 0/1 would have to be strings).

greenwoodma commented 5 years ago

I think the best option is to see if the two objects are comparable (i.e. the types are assignable one way or the other) and then to compare them properly. If they are of incompatible types then we should probably do toString() on both and compare the results. But whatever we do we should do consistently, so it might be a good candidate for a new method on gate.Utils

johann-petrak commented 5 years ago

I like that idea because one often is faced with similar choices in a Groovy script, JAPE or Java script, so having it in Utils would be handy! And we could gradually move to that behaviour (and document it) wherever we find the same kind of ambiguity.

greenwoodma commented 5 years ago

Looking at this in more detail and it's not so easy to fix as I thought. When you trace the code from the CorpusQA down through the many levels to the point at which the comparison is done it turns out it's simply a call to FeatureMap.subsume() to check the values which always ends up using .equals().

I don't think changing the subsume method is the right way to go as that is used in other places where we do want to insist on strict equality (I think), such as AnnotationSet.get(String type, Set features).

One option would be to add a parameter to subsumes that allows it to test for equivalence of feature values rather than equality, or we could simply replace the call to subsume by new code, but that would be mostly duplicated code, and that always opens up the possibility of other bugs creeping in as copies of code diverges.

Anyone have any thoughts or preferences?

johann-petrak commented 5 years ago

To me the main problem is that it is not clear how GATE is expected to handle featuremap values in the first place: it seems there are no real rules or promises about how values of different types are handled which would globally apply. The same key in two annotations of the same type can have to completely different types of values. Entering or updating the value through the GUI may make it impossible to stick to a specific type. Different plugins may have different ways to store values (e.g. true as boolean or string).

It is probably impossible to get any order into this so it seems it will always remain a decision to be made in context, for a specific situation. This means it would be good to have the basic methods where comparisons happen be able to choose from the API and it may mean that in some situations the behaviour that gets chosen can be specified by the user.

For Corpus QA maybe it would be enough to just decide we always use flexible automatic conversion where possible, and document that or decide we keep things as they are and document that?

So perhaps, in summary this could mean either:

implement a public utility function for doing flexible "is the same" comparison between different types
implement a version of subsume that uses that function instead of equals (flexiblesubsume?)
make Corpus QA use the flexiblesubsume method and document that

Or:

implement a public utility function for doing flexible "is the same" comparison for those who need it elsewhere
keep Corpus QA as it is and document that it will not auto-convert for comparison
maybe add methods and PRs to help users make sure that certain features in a feature map are always converted to one of the basic types and point to this from the Corpus QA documentation

greenwoodma commented 5 years ago

Yes, I have a method that does the equivalence rather than equality test implemented (not committed yet) as it was only after writing it that I realised there wasn't really anywhere to wire it in.

I think having extra subsumes methods with a boolean param to control if we do equals or equivalence, which uses the util method, is probably the safest way to go without breaking any existing stuff. We could then update the CorpusQA (and related stuff) to use that test.

Anyone have strong objections against that approach?

ianroberts commented 5 years ago

Thinking further about this, I wonder if it isn't safer just to stick to equals comparisons, clearly document that this is how it works, and provide some example JAPE shims to stick in your pipeline to convert feature values to/from strings where it is appropriate for your specific application. There are all sorts of caveats to Mark's suggested "relaxed" comparison case where it might actually end up considering things to be not-equal when they actually are, for example if I have two annotations with List-valued features that are .equals(), but one of them is a LinkedList and the other is an ArrayList (so neither type is assignable to the other).

greenwoodma commented 5 years ago

Yes, having discussed this in some detail at the developer meeting this morning, I'm in complete agreement with Ian. What this means in practice is that the bug isn't the the F-Measure scores are wrong but that the classification results are. So any "fix" to this should look at changing the classification code so that features with different types are considered to be different.

greenwoodma commented 5 years ago

So after more discussion the plan seems to have been agreed as:

We update the documentation to state that for results of the CorpusQA or AnnDiff to be valid, you must ensure that all values for a given feature are of the same type (or can sensibly be compared by .equals()). Ino other words if you feed it garbage expect to get garbage out. @adam-funk can you do this as part of the userguide clean up?
We add a code to the classification part of CorpusQA that warns when it thinks it's seen two annotations with the same feature but where the values have different types, as it has essentially merged two columns/rows in the confusion matrix as they have the same toString() representation
See if we can reduce the chance of accidentally changing the type of a feature value in the UI, to hopefully stop the even worse case than initially reported where you do get F-measure results but some of the features have been marked as not matching because of a type change.

adam-funk commented 5 years ago

@greenwoodma I'm adding that to my userguide to-do list.

greenwoodma commented 5 years ago

I've now fixed points two and three. @adam-funk can you close the issue once you are happy with the userguide?

adam-funk commented 5 years ago

User guide updated with warnings and a little advice (Corpus QA and Annotation Diff sections). I'm closing the issue now—please re-open it if I wrote anything incorrect.

ianroberts commented 5 years ago

See if we can reduce the chance of accidentally changing the type of a feature value in the UI

Idea from Kalina: can we have the UI check the type of the previous value and try to coerce the newly entered value to the same type? At least for Long, Integer, Float and Double if nothing else? Or at least pop up a warning when someone is about to accidentally change the type to be sure that's what they really wanted to do?

greenwoodma commented 5 years ago

I've opened it as a separate issue, as I think the answer is yes, but it's not really related to the original issue, so I think would be easier to track separately (when we want to work backwards and figure out where I broke something!)

GateNLP / gate-core

Corpus QA, inconsistent handling of feature value data types #72