clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
36 stars 24 forks source link

GroundingInsightExporter shows strange results #1082

Open kwalcock opened 2 years ago

kwalcock commented 2 years ago

I'm digging into compositional groundings using the GroundingInsightExporter, and I noticed that the returned "score" for a given slot grounding does not equal the "avg match" score produced by averaging all the positive examples. In some cases, the second best grounding by "score" has a higher "avg match" score than the top grounding, and in fact is sometimes the preferred grounding.

As an example, in a sentence like "X caused population growth", the top theme grounding is "wm/concept/population_demographics/" with a score of 0.88844055 but an avg match score of 0.60294354. The second best theme grounding is "wm/concept/population_demographics/population_density/population_growth" with a score of 0.86057734 (lower than the top grounding) but an avg match of 0.7405923 (higher than the top grounding).

Any idea why these scores are different, and where they are computed? I think I tracked down where "avg match" is getting computed, but the regular "score" is nested within nests of different grounding classes. Any help is greatly appreciated!

kwalcock commented 2 years ago

I'm looking at https://github.com/clulab/eidos/blob/9f48e7e275d3f2fec882bce96e61eda1f26c96d7/src/main/scala/org/clulab/wm/eidos/exporters/GroundingInsightExporter.scala#L85 and https://github.com/clulab/eidos/blob/9f48e7e275d3f2fec882bce96e61eda1f26c96d7/src/main/scala/org/clulab/wm/eidos/exporters/GroundingInsightExporter.scala#L143 I think you are describing output from near the second link, the ones for "max match" and "avg match".

From what I understand, a mention (some cause or effect) has been grounded as to theme and has returned several results, sorted from best to worst, in advance of all this code. The top two of those groundings go through examplesDetail and because of their order, it is expected that the values for the first will be higher than that of the second. However, they come out with average scores of 0.60294354 and 0.7405923 which are reversed.

Isn't the original grounding done in advance based on the bag of words of all examples of each node, along with definitions and descriptions and the node name, etc.? It doesn't seem like the resulting vector would be strongly related to any of the vectors of the specific examples. Adding up vectors, normalizing the sum, and then doing to dot product will produce a different answer than doing the dot product with the different examples and then averaging the result, won't it?

It seems like you are finding that the single vector is not sufficient and that there should be a vector (or matching text) for each example so that a couple of really good example matches could decide the winner rather than some combined vector that summarizes too many disparate examples.

kwalcock commented 2 years ago

FYI @zupon

MihaiSurdeanu commented 2 years ago

Thanks for looking into this @kwalcock!