blake-riley commented 1 year ago

Describe the issue From the paper: $\textrm{BIC} = n \ln(\textrm{RSS}/n)+k \ln(n)$ , where $k = 4n{\textrm{active atoms}} / t{d_{min}}$

$k$ is meant to be a description of the number of parameters (x, y, z, B) in the model that is being fit to density.

If we are not sampling B, then B is not a parameter of the model, and we should only multiply by 3n.
$1 / t{d{min}}$ is the maximum number of conformers that will be returned by the solver (i.e. an overestimate). As such, $n{\textrm{active atoms}} / t{d{min}}$ will always overestimate of the total number of atoms in the model. Since we can calculate the total number of conformers returned after the solver, should we not just use $k = 4 n\textrm{confs} n_{\textrm{active atoms}}$ ? I'm just realizing that this seems to be the reason behind @sauloho 's f97a090 commit that introduced $k$. This point may affect #328 .
There is an assumption that the parameters in the model are independent from each other. This is not true: bond lengths are constrained, so we're at least down 1 d.o.f. for every atom we introduce --- more if there are more constraints. This is particularly pertinent to qfit_ligand: we have to introduce a lot of new atoms even if they're only different from each other in 1 d.o.f., and so introducing a new conf to a ligand is often prohibitively expensive to complexity. At the moment, it seems like the simplest way to deal with this is to turn off BIC threshold selection by default during qfit-ligand. More long-term, how do we accurately represent the amount of parameters introduced via k? We might choose to introduce a tunable 'complexity penalty scale factor', $k = 4 f\textrm{penalty scaling} n\textrm{confs} n_{\textrm{active atoms}}$, where f is 0.75 by default for instance?

LMK your thoughts on this.

stephaniewankowicz commented 1 year ago

I have been playing around with this on our large test dataset (basically using the complexity penalty scale factor 0.5-0.9). I am going to run two additional tests:

1) k = (3 natoms) / threshold (representing not sampling b-factor; with b-factor sampling turned off) 2) k = (4 natoms*nconfs out of MIQP)

I will look at the 'complexity penalty scale factor' stuff this afternoon (basically tradeoff between #alt conf and #of deposited alt confs missed) will keep Rfree at ~same level. If something looks like a trend from that data, we can add it into the run scenarios above. If something else should be tested, let me know ASAP.

blake-riley commented 1 year ago

Parts 1 & 2 of this issue have been addressed by the (now closed) PR #328 .

blake-riley commented 1 year ago

As for part 3: #328 set a 'complexity penalty scale factor'=0.95 into k (98ecaa9). I still think that k—"free model parameters"—should correspond to the number of degrees of freedom in the model, though this is a lot more work.

336 (which I agree with) highlights that BIC $k$ will be based off the number of rotatable bonds (since this is the number of degrees of freedom that are sampled).

IMO, I think ultimately a similar approach should be taken for this issue:

6 dof for translation & rotations of residue
2 dof for backbone dihedrals phi + psi (and we shouldn't be sampling omega!)
1 dof if options.sample_angle
1 dof per residue.nchi

ExcitedStates / qfit-3.0

Discussion: BIC equation (in paper) overestimates number of parameters, overpenalizes more complex solutions #333

336 (which I agree with) highlights that BIC $k$ will be based off the number of rotatable bonds (since this is the number of degrees of freedom that are sampled).