boomer selects suboptimal solution in simple 3-node problem

cmungall commented 3 years ago

for text files see #157.

Given:

Pr(A properSubClassOf C) = 0.99
Pr(A equiv B) = 0.95
Pr(B equiv C) = 0.95

(in each case, the only other possibility is siblingOf)

note each class is in a separate prefix space, so there is no penalty for equivalence between any

Solutions:

1,2,3 : incoherent
1,2 : .99 .95 (1-.95) = 0.04
1,3 : .99 .95 (1-.95) = 0.04
2,3 : .95 .95 (1-0.99) = 0.009
1 : .99 .05 .05 = 0.0023
2 : .01 .95 .05 = 0.000475
3 : .01 .95 .05 = 0.000475
{} : .01 .05 .05 = 2.5e-05

boomer generally selects {1} depending on params, but never the optimal

I am pretty sure I have not made a typo - I put each class in its own ID space, so it is not avoiding 2 or 3 (which would happen if A/B/C were in the same ID space)

boomer -p prefixes.yaml -w 100 -r 1000 -t ptable.tsv --ontology logical.omn 
...
2021.02.05 09:23:19:376 [zio-def...] [INFO ] org.monarchinitiative.boomer.Main.program:49 - Most probable: 0.0024750000000000015
...
$ more output.txt 
A:1 SiblingOf B:1               0.05
B:1 SiblingOf C:1               0.05
A:1 ProperSubClassOf C:1        (most probable) 0.99

cmungall commented 3 years ago

I can confirm it's not avoiding any collapses, as if I reduce the ptable to omit 1

ie

A:1 B:1 0.0 0.0 0.95    0.05
B:1 C:1 0.0 0.0 0.95    0.05
A:1 C:1 0.99    0.0 0.01    0.0

then it correctly finds

B:1 EquivalentTo C:1    (most probable) 0.95
A:1 EquivalentTo B:1    (most probable) 0.95

balhoff commented 3 years ago

I think the issue here is the high number of "windows" requested (100). Input rows are sorted according to their best probability, then the list of rows is chunked into the given number of windows. Across each independent run, shuffling occurs within each window, but the windows stay in the same total order. So it will always first add A ProperSubClassOf C. If you use a window value of 1, the rows are completely randomized and it is able to find the best solution.

balhoff commented 3 years ago

See the logging at the beginning of a run (with 100 windows requested):

2021.02.05 14:32:54:070 [zio-def...] [INFO ] org.monarchinitiative.boomer.Boom.evaluate:30 - Bin size: 1; Most probable: 0.99
2021.02.05 14:32:54:091 [zio-def...] [INFO ] org.monarchinitiative.boomer.Boom.evaluate:30 - Bin size: 2; Most probable: 0.95
2021.02.05 14:32:54:095 [zio-def...] [INFO ] org.monarchinitiative.boomer.Boom.evaluate:33 - Max possible joint probability: -0.11263692462860261

The axioms in the first bin will always be added before proceeding to the next bin. Different runs will just shuffle the order of the two items in the second bin.

cmungall commented 3 years ago

my ticket is in error... more later

balhoff commented 3 years ago

I think we cleared this up. "windows" may not be as obvious as they ought to be but I think the UI will continue to evolve.

cmungall commented 1 year ago

still an issue

A:1 B:1 0.0 0.0 0.95    0.05
B:1 C:1 0.0 0.0 0.95    0.05
A:1 C:1 0.99    0.0 0.01    0.0

running boomer -t triangle.ptable.tsv -a triangle.owl -p prefixes.yaml -r 500 -w 1 -e 200 --output-internal-axioms true

yields

## SINGLETONS
Method: singletons
Score: -0.05129329438755058
Estimated probability: 1.0
Confidence: 1.0
Subsequent scores (max 10):

- [B:1](http://purl.obolibrary.org/obo/B_1) EquivalentTo [C:1](http://purl.obolibrary.org/obo/C_1)      (most probable) 0.95

and an incoherent output.ofn

INCATools / boomer

boomer selects suboptimal solution in simple 3-node problem #158