andre-martins / AD3

Alternating Directions Dual Decomposition
GNU Lesser General Public License v3.0
68 stars 38 forks source link

Surprising posterior format #3

Closed amueller closed 11 years ago

amueller commented 11 years ago

Hi Andre. I noticed some behavior of the inference results that I didn't expect. Sometimes, all posteriors for a given variable are 1 or all are zero. I think this happens only when the solution is integral, but I'm not sure. I would have expected the posteriors to always sum to one. What is the meaning of these posteriors? Thanks, Andy

andre-martins commented 11 years ago

Hi Andy,

That is weird. The posteriors for a variable (i.e. "MultiVariable" in the code) should sum to one (the sum ranging over the values of that variable). How is the factor graph you're defining? If you send me your code or data, I can try to reproduce the problem.

Best,

André

2013/4/30 Andreas Mueller notifications@github.com

Hi Andre. I noticed some behavior of the inference results that I didn't expect. Sometimes, all posteriors for a given variable are 1 or all are zero. I think this happens only when the solution is integral, but I'm not sure. I would have expected the posteriors to always sum to one. What is the meaning of these posteriors? Thanks, Andy

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3 .

amueller commented 11 years ago

Thanks for your lightning-fast response. I was using my Python wrapper with a very simple interface, but maybe there is a bug there. I just use create_multi_variable and create_dense_factor. I'll try to produce a minimalistic example.

amueller commented 11 years ago

I think I have found the problem in my code. The variables did not belong to any factors. I have a pairwise model where some variables might not be connected to the rest. Is there a simple way to do that? Do I have to create a dense factor for each of this variables?

amueller commented 11 years ago

I guess I should be calling FixMultiVariablesWithoutFactors, right?

amueller commented 11 years ago

One more question: In the branch and bound solver, I get branching with value = 1. It seems to me that if that is the most fractional value, the solution is integer. Still, it keeps branching and returns unsolved. Any idea why?

andre-martins commented 11 years ago

Hi Andreas,

Yes, FixMultiVariablesWithoutFactors should solve the problem. An alternative is to connect each of the BinaryVariables associated with the value of a MultiVariable to a XOR factor (which is precisely what FixMultiVariablesWithoutFactors does).

I'll come back soon to your branch and bound question...

Andre'

2013/4/30 Andreas Mueller notifications@github.com

I guess I should be calling FixMultiVariablesWithoutFactors, right?

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-17229110 .

amueller commented 11 years ago

Thanks for your help. Good to know I found the right fix! I must have overlooked that before.

andre-martins commented 11 years ago

That shouldn't happen... Is it still happening after running FixMultiVariablesWithoutFactors?

2013/4/30 Andreas Mueller notifications@github.com

One more question: In the branch and bound solver, I get branching with value = 1. It seems to me that if that is the most fractional value, the solution is integer. Still, it keeps branching and returns unsolved. Any idea why?

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-17235873 .

amueller commented 11 years ago

Yes. Maybe I did something else wrong, too. I'll see if I can generate an example / investigate.

andre-martins commented 11 years ago

Thanks. Please keep me posted about this.

The branch and bound function was not much tested in this release, specially with MultiVariables, so there might eventually be a bug somewhere. I've mostly used branch and bound with binary graphs with hard constraints. This part of the code is also not optimized (e.g. we could branch differently for multi-variables, and we could cache some of the computations in subsequent calls to the LP-MAP algorithm, but that's not yet implemented...)

Andre'

amueller commented 11 years ago

I saw that the implementation was quite simple, but it works very well for me :) I have to submit a paper today but I'm sure to keep you updated if I find out anything.

andre-martins commented 11 years ago

Thanks. Good luck with your submission!

Andre'

2013/5/1 Andreas Mueller notifications@github.com

I saw that the implementation was quite simple, but it works very well for me :) I have to submit a paper today but I'm sure to keep you updated if I find out anything.

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-17270528 .

amueller commented 11 years ago

So I finally did find some time to investigate. It is a bit hard to produce the example as it happens somewhere in learning. What I get is something like this:

Iteration = 127 Dual obj = 1208.79  Primal rel obj = 1208.79    Primal obj = -1e+100    Dual residual = 1.1372e-08  Primal residual = 7.84201e-16   Best dual obj = 1208.79 Best primal rel obj = 1208.79   Best primal obj = -1e+100   Cache
Solution value after 127 iterations (AD3) = 1208.79
Took 0.008 sec.
Solution is fractional.
Branching on variable 16 at depth 5 (value = 1) (most fractional = 0.25)

To me, this looks like a solution is marked as fractional even though it is not. Could this be the cause of the problem? Thanks, Andy

amueller commented 11 years ago

Btw, I also get

python: GenericFactor.cpp:595: virtual void AD3::GenericFactor::SolveQP(const std::vector<double, std::allocator<double> >&, const std::vector<double, std::allocator<double> >&, std::vector<double, std::allocator<double> >*, std::vector<double, std::allocator<double> >*): Assertion `distribution_[i] > -1e-16' failed.
[5]    25289 abort (core dumped)  nosetests -sv tests/test_latent_svm.py -m switch_to_ad3

Do you know what might cause this?

andre-martins commented 11 years ago

I'm looking at the code https://github.com/andre-martins/AD3/blob/master/ad3/FactorGraph.cpp#L741. And you're right there's a bug there!

Instead of

double most_fractional_value = 1.0;

it should be

double most_fractional_value = 0.25;

since 0.25 = (1-0.5)*(1-0.5).

However, I'm not sure this will solve the problem, since integer solutions should have been spotted before in the call to RunAD3 which should have returned status = STATUS_OPTIMAL_INTEGER.

It's possible that the problem has to do with numeric precision issues. In that case a quick fix would be in line 1164,

bool fractional = false; value = 0.0; for (int i = 0; i < variables.size(); ++i) { if (!NEARLY_BINARY((_posteriors)[i], 1e-12)) fractional = true; value += variables[i]->GetLogPotential() * (_posteriors)[i]; } for (int i = 0; i < additional_log_potentials.size(); ++i) { _value += additional_log_potentials[i] * (_additional_posteriors)[i]; }

replacing if

(!NEARLY_BINARY((*posteriors)[i], 1e-12)) fractional = true

by a smaller threshold, say

(!NEARLY_BINARY((*posteriors)[i], 1e-6)) fractional = true

Not very elegant, but will probably solve the problem...

Any chance you have a AD3 file with the factor graph so that I can try to reproduce the problem here?

André

2013/6/28 Andreas Mueller notifications@github.com

So I finally did find some time to investigate. It is a bit hard to produce the example as it happens somewhere in learning. What I get is something like this:

Iteration = 127 Dual obj = 1208.79 Primal rel obj = 1208.79 Primal obj = -1e+100 Dual residual = 1.1372e-08 Primal residual = 7.84201e-16 Best dual obj = 1208.79 Best primal rel obj = 1208.79 Best primal obj = -1e+100 Cache Solution value after 127 iterations (AD3) = 1208.79 Took 0.008 sec. Solution is fractional. Branching on variable 16 at depth 5 (value = 1) (most fractional = 0.25)

To me, this looks like a solution is marked as fractional even though it is not. Could this be the cause of the problem? Thanks, Andy

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20184441 .

amueller commented 11 years ago

Thanks a lot for your suggestion. I will try the precision change. I created the factor graph with the python interface and I don't have the factor parameters :-/ I can't see from the python side when that happens. I could maybe just output the model in the C++ file... I'll look into that.

amueller commented 11 years ago

Even setting the threshold to 1e-2 didn't make a difference. That seems quite odd. There is no way to dump a factor graph from the code, is there? I guess I have to hack some signal through the wrappers so I know which model it is (it happens during learning).

andre-martins commented 11 years ago

That's also odd. I suspect it may be some numeric underflow issue. You may try to comment out the assertion or perhaps to cout << distribution_[i] before aborting to see what's going on... Currently, my code is not very robust when the log-potentials are all tiny numbers (that's something I need to fix in the future).

Andre'

2013/6/28 Andreas Mueller notifications@github.com

Btw, I also get

python: GenericFactor.cpp:595: virtual void AD3::GenericFactor::SolveQP(const std::vector<double, std::allocator >&, const std::vector<double, std::allocator >&, std::vector<double, std::allocator >, std::vector<double, std::allocator >): Assertion `distribution_[i] > -1e-16' failed. [5] 25289 abort (core dumped) nosetests -sv tests/test_latent_svm.py -m switch_to_ad3

Do you know what might cause this?

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20184640 .

andre-martins commented 11 years ago

There's a FactorGraph::Print(...) function in https://github.com/andre-martins/AD3/blob/master/ad3/FactorGraph.h#L320, which dumps a factor graph to a AD3 format file. Maybe you could wrap that function in python. That function assumes every factor type has its Print(...) implementation. However, currently, that method is not implemented in some generic factors (another thing to be done in the future). What kind of factors are you using?

Andre'

2013/6/28 Andreas Mueller notifications@github.com

Even setting the threshold to 1e-2 didn't make a difference. That seems quite odd. There is no way to dump a factor graph from the code, is there? I guess I have to hack some signal through the wrappers so I know which model it is (it happens during learning).

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20194348 .

amueller commented 11 years ago

My factor graph is in this gist. Somehow I overlooked the Print function before.

amueller commented 11 years ago

I get Unknown factor type: 8 when trying to run the command line tool on the file, though :-/

andre-martins commented 11 years ago

I'm looking at the gist. The first factor you're plotting (in line 259) is not showing up its type. It's probably some factor which derives from GenericFactor (maybe a FactorDense) which doesn't have the Print(...) method implemented? If you tell me what type of factor is its, I can implement its Print(...) method and make a push to the repository.

However, there's something else... From the second line of the AD3 file, there should be 176 factors in the graph, and all I can see is 63 XOR factors plus the abovementioned factor in line 259. Any idea what are the missing 112 factors and why they aren't showing up in the file?

Andre'

2013/6/29 Andreas Mueller notifications@github.com

I get Unknown factor type: 8 when trying to run the command line tool on the file, though :-/

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20232276 .

andre-martins commented 11 years ago

Never mind, I just noticed everything is collapsed in line 259. I'll implement the Print method shortly. I guess those are DENSE factors, right?

2013/6/29 Andre Martins andre.t.martins@gmail.com

I'm looking at the gist. The first factor you're plotting (in line 259) is not showing up its type. It's probably some factor which derives from GenericFactor (maybe a FactorDense) which doesn't have the Print(...) method implemented? If you tell me what type of factor is its, I can implement its Print(...) method and make a push to the repository.

However, there's something else... From the second line of the AD3 file, there should be 176 factors in the graph, and all I can see is 63 XOR factors plus the abovementioned factor in line 259. Any idea what are the missing 112 factors and why they aren't showing up in the file?

Andre'

2013/6/29 Andreas Mueller notifications@github.com

I get Unknown factor type: 8 when trying to run the command line tool on the file, though :-/

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20232276 .

amueller commented 11 years ago

Yes they are, thank you soo much!

andre-martins commented 11 years ago

Done.

https://github.com/andre-martins/AD3/commit/2ced06d62e670193b9d3c303a610a792881ce417

Can you dump the factor graph again?

Andre'

2013/6/29 Andreas Mueller notifications@github.com

Yes they are, thank you soo much!

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20233299 .

amueller commented 11 years ago

Thanks. I updated the gist. When I do the branch-and-bound from the command line with the generated file, everything works nicely. That is a bit weird. I did

./ad3_multi --file_graphs=graph.ad3 --file_posteriors=out.blub --max_iterations=1000 --eta=.9  --exact=true

I guess that means I did something wrong in the wrappers? That the serialization changes the result seems a surprising. I'll try to investigate more.

amueller commented 11 years ago

I guess the problem is that I serialize in RunBranchAndBound where the graph is already modified...

amueller commented 11 years ago

Tada: ../AD3/ad3_multi --file_graphs=graph_1026413173.ad3 --file_posteriors=out.blub --max_iterations=4000 --eta=.1 --exact=true | grep "value = 1" with this gist.

andre-martins commented 11 years ago

Thanks for updating the gist. I just corrected a bug in the branch and bound function (every time a variable was being branched, it was marked as a branching variable, and was not being unmarked at the end - this was causing the problem you spotted). I just rerun AD3 in your factor graph and it works now:

Solution is integer. Solution value for AD3 ILP: 478.684 Elapsed time: 0.196 sec.

Thanks a lot for reporting this! And let me know if you spot something else.

André

2013/7/1 Andreas Mueller notifications@github.com

Tada: ../AD3/ad3_multi --file_graphs=graph_1026413173.ad3 --file_posteriors=out.blub --max_iterations=4000 --eta=.1 --exact=true | grep "value = 1" with this gist https://gist.github.com/amueller/5891630 .

— Reply to this email directly or view it on GitHubhttps://github.com/andre-martins/AD3/issues/3#issuecomment-20269366 .

amueller commented 11 years ago

Thanks a lot for the quick fix :) Awesome!